Google Summer of Code 2020 improved automatic maintenance of conda-forge

Personal Details

Name: Vinicius Douglas Cerutti

University: Federal University of Santa Catarina (Universidade Federal de Santa Catarina)

Email: connect with me with Linkedin or Gitter

Country of residence: Brazil

GitHub: https://github.com/viniciusdc

Linkedin: https://www.linkedin.com/in/vinicius-douglas-530202142/

Technical experience

I’m an undergraduate student in mathematics and computer science, 4th year, at the federal university of Santa Catarina, and my main work and future development will be focused in Optimization and data structure development. My interest areas are Distance Geometry, Nonlinear dynamic systems, artificial intelligence and Computational Neuroscience.

I Am currently working with my thesis project, that consists of a geometry distance problem which aims to determine a molecular conformation based on a guess on its intermolecular distances. The method involves a grand number of concepts, such as optimization, gradient-like methods, Barzilai and Borwein espectral steps, nonmonotone line search and semidefinite optimization. Some programming languages that I have previously worked include, Java, C, Python and Matlab, with total focus on Python.

Development Experience

Even though this is my first experience with a community project, and working in a globally setup. I’ve already experienced a lot, from the github comunitys as a student in Non-linear programming and scientific computation as one of my applied mathematics subjects.

As part of my, “scientific initiation” (a project of my university in coordination with Coordination for the Improvement of Higher Education Personnel) I’ve already worked with many articles and published sources for my thesis project, so my actual understanding of research and formal writing is well developed.

Why this project?

My main motivation is the ability to integrate myself in a new environment as a community and work with everyone to involve new abilities and share knowledge on a global scale. The second one, it’s exactly the community possibilities, not I will be using and working directly with the languages that I know, but I will get in touch with new possibilities that will definitely increase my potiencil and allow me to improve myself to help the project even more.

Another important feature, that makes me choose this project it’s because all the community is very receptive and the project main goal and its structure, the bot especially, call my curiosity to get an understanding of its affiliated process and how to improve it even more, cause its principal service works daily with a big amount of data, so any new improvement it’s a big step for future development.

Project

Project Abstract

Conda-forge, an umbrella project of NumFOCUS, is a community effort that provides conda packages for a wide range of software. The conda forge “auto-tick” bot plays an important part on the automatic maintenance of conda-forge packages. This project aims at the improvement of the conda-forge “auto-tick” bot structure, more specifically, separating the version updates structure to its own microservice, creating a more organic environment for the bot, which in return will simplify the process of its migration to other data structures like Dynamodb.

Technical Details

As cited in the ideas page of conda-forge, the auto tick bot enables automatic maintenance of conda-forge packages by pushing version updates to the underlying software and enabling large migrations of packages from one dependency to another. With everything occurring in the same data structure.

This structure is called graph which nodes represents the packages, that are currently in our feedstocks, i.e, a repository with the conda recipes and continuous integration configuration needed for build, test, and deployment of packages. As an simple example of the graph structure

{"graph":{...}, "links":[{"source":..., "target":...}], "nodes":[{"id":..., "payloads":...}]}

where it respective payload is a LazyJson with current information about the package, like

PRed bad conda-forge.yml feedstock_name hash_type meta_yaml name new_version pinning_version raw_meta_yaml req requirements smithy_version strong_exports total_requirements url version

Currently when a package passes by the automated bot, some information is checked by calling other microservices, one of these is the update_upstream_version function, its main goal is the requisition of every new version for the packages we currently have, getting this information using the sources that are provided for each package in our feedstock.

But, unfortunately the current environment for this function is defined inside the main structure of the graph, i.e, the updates occur inside the current graph. So, the the fundamental idea for this project is to move the data structures used to update the packages to another repo with a separate process, not directly interfering with the graph and the others bot microservices.

Proposal Timeline.

Community Bonding Period - Before March

To familiarize myself with the structure of conda-forge community and it’s bots infrastructure.
Study of version control facilities and linux based shells for commits in the master repo, and JSON and YAML file structures with python.
Create an standard worksetup for conda-forge repositories, and inciate tests with coda-forge auto-tick bot.
With the help of my mentors I will become more clear about the bot infrastructure and my future goals during the implementation of the new microservice.
Start digging for issues, and do some attempts for the new migration code.
Created a blog, in Github, for communication and content share https://viniciusdc.github.io/

Phase 1.1 - March until the April

Start new tests with the new structure and formalize the code of the new version function.
During this period I will remain in touch with my mentors and the Conda-forge community. I will remain active on the Gitter to discuss, and GitHub issues pages, to finalize the modifications on the new repository setup and migration schema.
The goal for this fase, is to finish the code that writes the new versions to a separate data structure outside the graph environment. Which can be seen on the issues page of conda-forge cf-script page on GitHub pull#907, or on my blog.
Once the code is finished, start the work with the output file format, our goal is to prepare it as a Dynamodb data file.
Scheduling the reunion dates, and report schemes for the program.

Phase 1.2 - April until the coding period starts

Continue the previous work, and with the help of my mentors start the schemas for the new service.
Studying a little bit more about Dynamodb, and start the tests with the version update function and its communication with the bot services.
Full integration with the new data structures and merge with Dynamodb, this is, refactoring the graph data structure and merge the new update versions microservice within, the version function now works independently with the graph.
Once that works, finish the adjustments for possible errors and bugs, and prepare the corresponding documentation for the new service (e.g there is a problem when running it directly from a Windows enviroment).
Preparation for the final merge, and a systematic analysis for eventual time consuming operations.
Set a Communication line between conda-forge community and Google summer of code, to release the current work as weak reports. Probably using the blog to update information.

Phase 2 - Oficial coding period starts

Define all the required Relations, and packages needed, in my local repository, and upgrade my actual work setup if needed.
Define all the corresponding Python Classes and Objects that will be used, as PyPy, CRAN and others sources, modify the codes presented in cf-scripts repo as the update functions main structure, more specifically, deleting the old service and prepare all the necessary archives for a new upgrade.
Define all the interactions that can be possible arranged with the bot main service and the new update structure, as a synopsis of how the new microservice is working.
Once all the work is done, this is, all information relating the packages upgrade system, by the conda-forge bot, is arranged, we are gonna start completely refactoring the cf-scripts-update-function to work independently from the graph structure, using the previous work on the update-version function and possibly begin the test for allocation on Dynamodb.

This will help in testing the proper working of the entire basic code changes that we will later on incorporate with the Dynamodb (new tables) and auto-tick bot recovery service.

Testing the overall working of each and every new code added to new service, maybe create a test option, to run together with the bot, to see if the system cooperates and the updates are happening as expected.
- We can create a dir, inside dynamodb specifically for that, with some packages examplesmcurrently being used and some “bad exceptions”.
Try a new idea for the packages organizations as a goal for efficiency when retrieving the new version, more accuracy and velocity. This idea is related to how the bot reads the information relating to each package version, there are two main ideas here:
- Allocating each package by it source, facilitating the get_links function work;
- Or, separating each package by subgroups, by lexicographic order, during the new version request to reduce the time spent in each iteration of “for”.

Phase 3 - Final

Now enters the most critical phase, the coordination between the new version requests, new update structure, with the bot other services, as bot works directly with graph dependencies which now is a separate service from main. It’s important to receive the new version of the packages stored in Dynamodb with the current service of bot, to not lose any information during the process.
Again, the necessity for a test option, to obtain more information about the communication between the processes.
If possible, handling the issues for the automatic PR for new packages.
Now, inspect the code for any other possible issue, and formatting to final merge.
Inspect the documentations, and add information about the migration process and a detailed explanation of the bot services.
Perfect, now that everything is working, begin the preparation for the last report for the Google Summer of code, with the completed code and documentation, as a simple article reviewing all the new feature and searches made. Completing with a general list of possible improvements.

Extra content

As an extra work, if possible, try to merge this microservice in a new subprocess for the bot, to separate the current IO structure of the bot main computation. In this case, as we are currently working on a communication between the graph and the data stored in Dynamodb (the update structure) we can use this opportunity to increase the efficiency for the current feedstocks pulling information scheme, maybe storing it in this new database.