terriko: (Default)
This is crossposted from Curiousity.ca, my personal maker blog. If you want to link to this post, please use the original link since the formatting there is usually better.


If you just want the answer to “where do I find a reliable global mirror of NVD vulnerability data?” or “Where should I get a list of CVEs if NVD is down?” the answer is https://cveb.in/ . It is being mirrored on the same servers used for major open source projects so you’re probably already trusting them, and they should be fast and may be very close to you. Please go ahead and use it and let us know how it works for you!





I co-presentented a talk about this work at BSidesPDX on Saturday, October 26, 2024:









Often when I write about talks I’ve given, I try to kind of recreate them in blog posts to be a bit of a director’s cut were I add in a bit of extra material that didn’t make the talk but they’re pretty similar to what I said on stage. This time, though, since I didn’t give the second half of the talk and John and I have very different ways of telling a story, I’m just gonna tell a story in this blog post and maybe toss in a few slides. If you want to watch us both tell the story from our own perspectives, check out the video. Although we collaborate on a lot of stuff it’s surprisingly rare for us to share a stage.





Still here and not going with the recording? Okay, let me tell you a story…





The US government DDOSed itself





Once upon a time, not so long ago, the US government decided it wanted to raise the bar for software security in their supply chain, and they wrote up an executive order on cybersecurity explaining how they wanted suppliers to do better, including a section on not shipping software with known vulnerabilities. Many other groups followed suit with similar recommendations or requirements.





As a result, a lot of organizations’ security plans started to look a lot like this:





A diagram from my talk about NVD mirroring. The top of the slide is labelled "2024 Corporate Security Policy_final_FINAL.doc" (which is a joke about filenames for things that undergo a lot of revisions). There are then three columns. The first is labelled Step 1 and there is text in a red box that reads "Scan components for vulnerabilities." Step 2 has an orange box which contains the text, "???" and Step 3 has two green boxes, one of which says "EVERYTHING IS SECURE" and has a picture of a closed lock. The second reads, "$$ Profit $$"
Image Description: A diagram from my talk about NVD mirroring. The top of the slide is labelled “2024 Corporate Security Policy_final_FINAL.doc” (which is a joke about filenames for things that undergo a lot of revisions). There are then three columns. The first is labelled Step 1 and there is text in a red box that reads “Scan components for vulnerabilities.” Step 2 has an orange box which contains the text, “???” and Step 3 has two green boxes, one of which says “EVERYTHING IS SECURE” and has a picture of a closed lock. The second reads, “$$ Profit $$”




There is a lot to say about steps 2 and 3 here, but our problem starts at the beginning of Step 1. To scan for vulnerabilities, you need a list of software you’re providing (which is a whole talk in and of itself) and a list of known software vulnerabilities.





One of the biggest sources of vulnerability data actually comes from the US government: the NVD (National Vulnerability Database) provided by NIST (National Institute of Standards and Technology). It’s pretty great — they provide it fully free, publicly licensed. This is usually where you go to get information about CVEs (Common Vulnerabilities and Exposures).





But what do you think happens if every single US government supplier and indeed, many other software companies around the world, all try to grab this data at once? And more than that, many of them start enabling regular scanning so they’re grabbing it multiple times per day, or per hour?





A slide from my BSidesPDX 2024 talk which reads "Distributed Denial of Service" and has photo I took of some street signs near the train tracks. The relevant one is a large yellow caution sign that shows a person with a bike getting a wheel stuck in the train tracks and the rider is being launched off the bike over the tracks.
Image Description: A slide from my BSidesPDX 2024 talk which reads “Distributed Denial of Service” and has photo I took of some street signs near the train tracks. The relevant one is a large yellow caution sign that shows a person with a bike getting a wheel stuck in the train tracks and the rider is being launched off the bike over the tracks.




So, yeah, the US government kind of started a denial of service attack against its own agency. And in case that wasn’t bad enough, we started seeing headlines like “NIST Struggles with NVD Backlog as 93% of Flaws Remain Unanalyzed ” where the stories talked about funding cuts at NIST.





The fine folk at NIST have been doing a hard job with not enough resources and some really unfortunate timing, so they’d already been working on keeping things from being overwhelmed. They had introduced rate limits per IP address/API key to keep rogue scanning jobs from ruining things for everyone, and they had started providing an API that allowed people to get just the newest data instead of having to download things every time. Unfortunately, the API combined with rate limits was pretty slow so getting the full database the first time using the API was onerous when it worked at all. Several of my colleagues in the UK and in India had such long delays that they had to give up and bootstrap the “old” way to get started. And a lot of people were running their scanning within ephemeral containers and just didn’t cache the copy of the database at all so they wanted to get all the data fresh with each new scan. When neither the rate limits nor the API was enough to address demand longer-term, and with budget cuts on the horizon, NIST turned to looking for industry partnerships and additional funding.





It was clear that this wasn’t a problem that was getting solved quickly.





That sounds bad, but how is that YOUR problem, Terri?





Why did I care? I mean, obviously I’m a security professional and things that stand in the way of good security choices are a problem for me in general. But in this case, my work open source project involves building a vulnerability scanner called cve-bin-tool: https://pypi.org/project/cve-bin-tool . It’s a free, open source software vulnerability scanner for binary files, git repos, and SBOMs.





(Quick reminder: This is my personal blog and as such, all opinions here are my own and do not necessarily reflect those of my employer.)





In the course of developing software to scan for vulnerabilities, we’d gotten a front row seat to all of the NVD changes: we’d had to start using API Keys and explaining them to our users, we’d had to handle new timeout messages and do appropriate backoffs and rate limits, and we’d started getting reports from users that updates were slow or not working. Many users and contributors located outside of the US were experiencing extensive delays.





Following NVD best practice had been making our code more complex, our software harder to use, and our users unhappy. It’s hard enough to get software developers to care about vulnerabilities, and it was getting uncomfortably hard to do something that had previously been pretty easy to install and try. But while we supported other data sources with vulnerability data, NVD was still the biggest one and the one people wanted the most.





How do we make vulnerability data available to everyone?





We probably could have solved the problem for cve-bin-tool similar to how commercial entities have handled it: make our own copy, query that, keep it updated separately. They often add proprietary data (such as the missing triage of new vulnerabilities) and then sell access to that data as part of their solution. We were already keeping a local copy of the data in github so our CI jobs would quit timing out at inopportune moments. But my goal has long been to make software more secure for everyone. What if we thought bigger than one python application? What if I built a solution that would help the whole world?





A slide from my BSidesPDX 2024 talk. On one side, it reads "what if we helped the *world* get vulnerability data?" and on the other side it has a screenshot of a tumblr post. The first post is from writing-prompt-s and reads "In a game with no consequences, why are you still playing the 'Good' side?". The next post is from raphaeliscoolbutrude and says "Because being mean makes me feel bad." The final post is from user deflare and reads, "Because my no-consequences power fantasy is *being able to help everyone*"
Image description: A slide from my BSidesPDX 2024 talk. On one side, it reads “what if we helped the *world* get vulnerability data?” and on the other side it has a screenshot of a tumblr post. The first post is from writing-prompt-s and reads “In a game with no consequences, why are you still playing the ‘Good’ side?”. The next post is from raphaeliscoolbutrude and says “Because being mean makes me feel bad.” The final post is from user deflare and reads, “Because my no-consequences power fantasy is *being able to help everyone*”




It might have been easy to lay a lot of the blame on people using “ephemeral” continuous integration jobs. They typically grab a mostly empty linux image, install/update some software, download the thing they want to scan, download the vulnerability data, store a report somewhere, then throw the rest of the thing away to start fresh next time. If they just cached the data instead of grabbing it every single time, we wouldn’t be in this mess.





But we could learn from what they were doing too: it was perfectly viable for them to download entire software binaries every single time, and no one batted an eye at that. Why was it easier and faster to get the software than to get meta data about the software? The answer, of course, is that we weren’t all trying to download from a single underfunded government agency. But instead we were downloading from… a bunch of underfunded open source hippies? How was that working but the government servers weren’t?





I am old enough that I knew the answer. Open source had solved their distribution problem by asking people to store a “mirror” (a copy of all the files) on their own servers, then building infrastructure to help people find the one closest to them. It all happened long before anyone had coined the term “cloud service provider” and it had happened on shoestring budgets with people donating a bit of space in a server rack and a bit of bandwidth. A lot of early mirrors were in universities or small internet service providers who had an open source enthusiast on staff. Get enough of them, and suddenly everyone gets software and no one gets stuck with a giant bill or an overloaded server.





It looked like neither government nor industry was going to solve this problem on the timeline I wanted and maybe never on the global scale that would make my life easier. But I have access to resources that a government agency maybe doesn’t: I know where one of the world’s leading experts on open source mirroring lives. It’s in my house. Because I married him. As well as having years of experience in multiple roles, he’s actively involved in running one of the larger open source content distribution networks in the world. So I had access to exactly what I needed to help everyone. I walked upstairs and said, “Hey John, if I wanted to mirror the NVD data on the micro mirrors, could we do that?” and then we figured out how to make it happen.





FCIX Micro Mirrors





This is the point at which I handed the talk over to John. But here’s my truncated version of his half of the story.





John builds infrastructure the way I knit: compulsively and constantly. And when he’s not actually doing something with his infrastructure there’s a good chance he’s thinking about it or talking about it. He hosts people’s websites and emails and mastodon accounts, he accidentally got involved in founding a whole internet exchange, and he’s forever automating and building backends for things in the house that I really wish weren’t internet-enabled. (Look, I’m a security professional, I’m allergic to too much internet.)





One day his friend Kenneth decided it would be fun to run a software mirror for their internet exchange, and he roped John into it, and then into this hare brained idea of maybe running a lot of mirrors on cheap hardware. John had previously run kernel.org and the associated linux mirrors there, and he had done so on big beefy servers with big beefy bandwidth, so he was skeptical that this would work. Still, not only was it cheap to try and see, but thanks to some donations they didn’t even have to lay out much of their own money to get it going. And long story short: it turns out it works incredibly well.





The deal is that they build up these cheap “thin client” boxes with a hard drive in them that have a copy of the data and are managed remotely by John and Kenneth. Then they offer them up to free to data centres who are willing to provide power and internet. It’s kind of a fully managed appliance, so the data centre gets blazing fast downloads of open source software for their customers and anyone else “nearby” and Kenneth and John get a dot on their map and the knowledge that they’re helping distribute open source software. (Also they get to run globally load bearing infrastructure for funsies. Which it really is for them.)





Here’s my favourite picture: since one of our contributors is based in the UK, we turned on the UK-based mirrors first, and one of them is a data center in a box in a field:





A dark green utility box sitting in a beautiful field with yellow summer grass, green bushes, and green trees along the edges. There is a wedge of blue sky with clouds visible. One of the software mirrors is inside the green box.
Image Description: A dark green utility box sitting in a beautiful field with yellow summer grass, green bushes, and green trees along the edges. There is a wedge of blue sky with clouds visible. One of the software mirrors is inside the green box.




What’s been amazing is that this little network of devices is now a major powerhouse of linux mirroring. They estimate that they’re providing 90% of the bandwidth used for VLC, so if you’ve downloaded that or anything else they serve, there’s a good chance you’ve already used one of these mirrors and not known it. https://mirror.fcix.net/ if you want to see the list of projects. Kenneth is giving a talk at SeaGL in November if you want to hear more about the micro mirror story.





Serving the right data: files are better than APIs





The key to using these tiny servers is basically “linux people optimized sending files in order to make mirrors work.” And they did that quite a while ago so it’s really stable and fast now. You might think “oh, couldn’t you use bittorrent?” but that adds a lot of overhead. (That paper is older, but the numbers haven’t made it look more appealing in the time since then.)





If we want to go with what works, then, we can’t mirror the NVD API — that would require processing and these mirrors are not that smart. But it turns out… people didn’t really love the NVD API. It definitely filled a need for some folk, but when they tried to turn off the old file-based data so many people protested that the original deadline for removing the files got pushed out and pushed out. So we can probably guess that many users would like the files as much or better than the API, assuming they could get them faster and without rate limits.





So here’s what it looks like:






  1. We are running our own API crawler




  2. Generating json files compatible to the original ones




  3. Signing those files with possibly the sketchiest gpg key on the planet




  4. Mirroring these files to a worldwide CDN we created




  5. Literally solving the entire API / DDOS problem for… free?





Since cve-bin-tool has to speak API already, we can have cve-bin-tool output valid json files when needed. Although since NVD is still providing the json files at the time of this writing, we can (and do) get their files directly.





I should note that the technical implementation and testing in a live environment took a few months once we decided to do it. Much faster than waiting for funding!





Why should you trust us?





First: we are not affiliated with NIST, and they were not involved in any of this. Although I did email them so they knew who was behind it in case it came up and I got a nice email saying effectively they don’t officially endorse anything, which is fine. I want to joke that I’m the pirate radio of vuln data, but recall that the data is licensed public so there’s no piracy involved. Just fast and efficient transmission of perfectly allowed data.





So why should you trust some internet randos to get you vulnerability data? After all, the software security industry tries to tell you to stop downloading files served by random people on the internet! But these are the same servers that you’re probably using to get security updates, so… you probably already do trust them?





For a lot of the software on these mirrors, it’s a trust-but-verify solution where the packages are signed and package managers validate those so even if one of the data centres wanted to serve up malicious code, it wouldn’t get auto-installed unless they also compromised some build and signing servers. So you’re trusting not just the mirror, but the whole process to make sure the mirror serves up the right data.





If you’re going to build some similar verification into your tool that uses NVD data, you can verify our (sketchy) gpg signatures so you know it came from us, but you can also validate the data against NVD itself. For the json files they provide some metadata you can use. If we’re generating our own json (as we expect to do when they turn off theirs) then it might get a bit more complicated, but you can probably figure something out. For example, if validating all the data is impractical, you could have something that uses the API to double-check only the CVEs you care about. You can also always use us as your “seed” source and then update against NVD directly thus overwriting as needed.





(Incidentally, don’t bother trying to run a json schema check on the data as part of your checks unless you like noise. We did this in cve-bin-tool and had to turn it to just warn instead of halting because NVD themselves produce invalid json files frequently enough that it was a problem. Turns out keeping a giant database full of user-submitted data valid is hard.)





Using the mirror





The instructions are here: https://cveb.in/





Basically, go nuts. Those little thin clients can handle full fedora releases and don’t even max out on release day any more. Please use them! They should be fast, they are probably significantly less overloaded than the main NVD servers, and there’s no rate limits or API keys needed. Plus, you’ll make pretty marks on John’s graphs.





You can also use the mirror data as part of cve-bin-tool so you don’t have to build your own scanning service!









Conclusion





I noticed a problem where software vulnerability data about CVEs was getting harder and harder to access, and roped the fine folk of the FCIX Micro Mirror project into hosting a copy of this publicly available data on https://cveb.in/ which they are doing for free thanks to donations of time, money, and server rack space from a variety of folk. These mirrors are fast, available worldwide, not rate limited, and we would love it if you used them.





Contacting us





The comments for this post will turn off after a few weeks because I don’t feel like dealing with spam, feel free to hit me or John up with questions on the fediverse anytime! We’d also to love to hear how you use https://cveb.in/









Future work





I’m not actively working on mirroring anything else at the moment, but I *do* think it would be super cool if we could get the micro mirror system to help provide files for pypi / pip. So if you’ve got a lead there and global distribution of python packages sounds like a good idea, let us know! And if you’ve got any other way we could make the world a better place for free, that’s cool too.

terriko: (Default)
This is crossposted from Curiousity.ca, my personal maker blog. If you want to link to this post, please use the original link since the formatting there is usually better.


This is part of my series on “best practices in practice” where I talk about best practices and related tools I use as an open source software developer and project maintainer. These can be specific tools, checklists, workflows, whatever. Some of these have been great, some of them have been not so great, but I’ve learned a lot. I wanted to talk a bit about the usability and assumptions made in various tools and procedures, especially relative to the wider conversations we need to have about open source maintainer burnout, mentoring new contributors, and improving the security and quality of software.





If you’re running Linux, usually there’s a super easy way to check for updates and apply them. For example, on Fedora Linux `sudo dnf update` will do the magic for you. But if you’re producing software with dependencies outside of a nice distro-managed system, figuring out what the latest version is or whether the version you’re using is still supported can sometimes be a real chore, especially if you’re maintaining software that is written in multiple programming languages. And as the software industry is trying to be more careful about shipping known vulnerable or unsupported packages, there’s a lot of people trying to find or make tools to help manage and monitor dependencies.





I see a lot of people trying to answer “what’s the latest” and “which versions are still getting support” questions themselves with web scrapers or things that read announcement mailing list posts, and since this came up last week on the Mailman irc channel, I figured I’d write a blog post about it. I realize lots of people get a kick out of writing scrapers as a bit of a programming exercise and it’s a great task for beginners. But I do want to make sure you know you don’t *have* to roll your own or buy a vendor’s solution to answer these questions!





What is the latest released version?





The website (and associated API) for this is https://release-monitoring.org/





At the time that I’m writing this, the website claims it’s monitoring 313030 packages, so there’s a good chance that someone has already set up monitoring for most things you need so you don’t need to spend time writing your own scraper. It monitors different things depending on the project.





For example, the Python release tracking uses the tags on github to find the available releases: https://release-monitoring.org/project/13254/ . But the monitoring for curl uses the download site to find new releases: https://release-monitoring.org/project/381/





It’s backed by software called Anitya, in case you want to set up something just for your own monitoring. But for the project where I use it, it turned out to be just as easy to use the API.





What are the supported versions?





My favourite tool for looking up “end of life” dates is https://endoflife.date/ (so easy to remember!). It also has an API (note that you do need to enable javascript or the page will appear blank). It only tracks 343 products but does take requests for new things to track.





I personally use this regularly for the python end of life dates, mostly for monitoring when to disable support for older versions of Python.





I also really like their Recommendations for publishing End-of-life dates and support timelines as a starting checklist for projects who will be providing longer term support. I will admit that my own open source project doesn’t publish this stuff and maybe I could do better there myself!





Conclusion





If you’re trying to do better at monitoring software, especially for security reasons, I hope those are helpful links to have!

terriko: (Default)
This is crossposted from Curiousity.ca, my personal maker blog. If you want to link to this post, please use the original link since the formatting there is usually better.


This is part of my series on “best practices in practice” where I talk about best practices and related tools I use as an open source software developer and project maintainer. These can be specific tools, checklists, workflows, whatever. Some of these have been great, some of them have been not so great, but I’ve learned a lot. I wanted to talk a bit about the usability and assumptions made in various tools and procedures, especially relative to the wider conversations we need to have about open source maintainer burnout, mentoring new contributors, and improving the security and quality of software.





I was just out at Google Summer of Code Mentor Summit, which is a gathering of open source mentors associated with Google’s program. Everyone there regularly works with new contributors who have varying levels of ability and experience, and we want to maintain codebases that have good quality, so one of the sessions I attended was about tools and practices for code quality. Pre-commit is one of the tools that came up in that session that I use regularly, so I’d like to talk about it today. This is a tool I wouldn’t have thought to look for on my own, but someone else recommended it to me and did the initial config for my project, so I’m happy to pay that forwards by recommending it to others.





Pre-commit helps you run checks before your code can be checked to git. Your project provides a config file of whatever tools it recommends you use. Once you’ve got pre-commit installed, you can tell it to use that file, and then those checks will run when you type `git commit` with it halting if you don’t pass a check so you can fix it before you “save” the code. By default it only runs on files you changed and can be tuned by the project maintainers to skip files that aren’t compliant yet, so you don’t generally get stuck fixing other people’s technical debt unless that’s something that the maintainers chose to do.





Under the hood there’s some magic happening to make sure it can install, set up, and configure the tools. It does tell you what’s happening on the command line, but it’s worlds better than having to install them all yourself, and it puts it into a separate environment so you don’t have to worry about needing slightly different versions for different projects. Honestly, the only time I’ve had trouble with this tool was when I was using it in a weird environment behind a proxy and some combination of things meant that pre-commit was unable to set up tools for me. I think that’s more of a failure of the environment than of the tool, and it’s been shockingly easy to set up and use on every other development machine where I’ve used it. One command to install pre-commit, then one command to set it up for each project where I use it.





I’m sure there are some programmers who are incredibly disciplined and manage to run all required checks themselves manually, but I am not the sort of person who memorizes huge arrays of commands and flags and remembers to run them Every Single Time. I am the sort of person who writes scripts to automate stuff because I will forget. Before pre-commit I would have had a shell script to do the thing, but now I don’t have to write those for projects that already have a config file ready for me. Thus, pre-commit speaks to the heart of how I work as a developer. I got into computers because I could make them do the boring stuff.





A photo of the package locker in a US shared mailbox.  A label around the keyhole reads "open" with arrows and then says "key will remain in lock after opening door" -- it's a great example of design that doesn't rely on users remembering to do the right thing (in this case, giving back the key for future use)
Image Description: A photo of the package locker in a US shared mailbox. A label around the keyhole reads “open” with arrows and then says “key will remain in lock after opening door” — it’s a great example of design that doesn’t rely on users remembering to do the right thing (in this case, giving back the key for future use)




Pre-commit also speaks to the heart of my computer security philosophy: any security that relies on humans getting things 100% right 100% of the time is doomed to fail eventually. And although a lot of this blog is about knitting and fountain pens and my hobby work, I want to remind you that I’m not just some random person on the internet when it comes to talking about computer security: I have a PhD in web security policy and I work professionally as an open source security researcher. Helping people write and maintain better code is a large portion of my day job. A lot of the most effective work in security involves making it easy and “default” for people to make the most secure choices. (See the picture above for a more physical example of the design philosophy that ensures users do the right thing.)





Using pre-commit takes a bunch of failure points out of our code quality and security process and makes it easier for developers to do the right thing. For my current work open source project, we recommend people install it and use it on their local systems, then we run it again in our continuous integration system and require the checks to pass there before the code can be merged into the main branch.





As a code contributor:






  • I like that pre-commit streamlines the whole process of setting up tools. I just type pre-commit install in the directory of code I intend to modify and it does the work.




  • I can read the .pre-commit-config.yaml file to find out a list of recommended tools and configurations for a project all in one place. Good if you’re suspicious of installing and using random things without looking them up, but also great for learning about projects or about new tools that might help you with code quality in other projects.




  • It only runs on files I changed, so the fixes it recommends are usually relevant to me and not someone else’s technical debt haunting me.




  • It never forgets to run a check. (unless I explicitly tell it to)




  • It helps me fix any issues it finds before they go into git, so I don’t feel obliged to fuss around with my git history to hide my mistakes. Git history is extremely obnoxious to fuss with and I prefer to do it as infrequently as humanly possible.




  • It also subtly makes me feel more professional to know that all the basic checks are handled before I even make a pull request. I’ve been involved in open source so long that I mostly don’t care about my coding mistakes being public knowledge, but I know from mentoring others that a lot of people find the idea of making a mistake in public very hard, and they want to be better than the average contributor from the get-go. This is definitely a way to make your contributions look better than average!




  • It gives me nearly immediate, local feedback if my code is going to need fixes before it can be merged. I like that I get feedback usually before my brain has moved on to the next problem, so it fits into my personal mental flow before I even go to look at another window.




  • It can get you feedback considerably faster than waiting for checks to run in a continuous integration system. If you’re lucky, a system like github actions can get you feedback within a few minutes on quick linter-style checks, but if the system is backed up it or if you’re a new contributor to a project and someone has to approve things before they run (to make sure you’re not just running a cryptominer or other malicious code in their test system!), it can take hours or days to get feedback. Being able to fix things before the tests run can save a lot of time!





As a project maintainer:






  • Letting me configure the linters and pre-checks I want in one place instead of multiple config files is pretty fantastic and keeps the root directory of my project a lot less full of crap.




  • It virtually eliminates problems where someone uses a tool subtly differently than I do. If you’re not an open source project maintainer who works with random people on the internet you may not realize how much of a hassle it is helping people configure multiple development tools, but let me tell you, it’s a whole lot easier to just tell them to use pre-commit.

    • Endlessly helping people get started and answering the same questions over and over can be surprisingly draining! It’s one of the things we really watch for in Google Summer of Code when trying to make sure our mentors don’t burn out. Anything I can do that makes life easier for contributors and mentors and avoid repetitive conversations has an outsized value in my toolkit.






  • Being able to run exactly the same stuff in our continuous integration/test system means even if my contributors know nothing about it, I still get the benefits of those checks happening before I do my code review.




  • It saves me a lot of time back-and-forth with contributors asking for fixes so it lets me get their code merged faster. A nicer experience for all of us!




  • I can usually configure which files need to be skipped, so it can help us upgrade our code quality slowly. Or I can use it as a nudge to encourage people changing a file to also fix minor issues if I so desire.





What gets run with pre-commit will obviously depend on the project, but I think it’s probably helpful to give you an idea of what I run. I talked about using black, the python code formatter in a previous best practices post. For my work open source project, it’s only one of several code quality linters we use. We also use pyupgrade to help us be forward-compatibile with python syntaxes, bandit to help us find python security issues, gitlint to help us provide consistency in commit messages (we use the conventional commits format rules), and mypy to help us slowly add static typing to our code base.





Usually before installing a new pre-commit hook, I make sure all files will pass the checks (and disable scanning of files that won’t). Some tools are pretty good at a slow upgrade if you so desire. One such tool for us as been interrogate, which prompts people to add docstrings — I have it set up with a threshold so the files will pass. The output when pre-commit runs generates a report with red segments in it if there’s missing docstrings for some functions, even if the check passes so you don’t have to fix them. Sometimes that means someone working in that file will go ahead and fix those interrogate warnings while they’re working on their bugs, and that’s incredibly nice.





I’ll probably talk about some of these tools more later on in this best practices in practice series, but that should give you some hints of things you might run in pre-commit if you don’t already have your own list of code quality tools!





Summary





Pre-commit is a useful tool to help maintain code quality (and potentially security!) and it can be used to slowly improve over time.





I only found out about pre-commit because someone else told me and I’m happy to spread the word. I don’t think tools like pre-commit attract evangelists the way some other code-adjacent tools do, and it’s certainly not the sort of thing I learned about when I learned to code, when I got involved in opens source initially, or even when I was in university (which was long after I learned to code and got into open source). I’m sure it’s not the only tool in this category, but it’s the one I use and I like it enough that I haven’t felt a need to shop around for alternatives. I don’t know if it’s better for python than for other languages, but I love it enough that I could see myself contributing to make it work in other environments as needed, or finding similar tools now that I know this is an option.





As a project maintainer, I feel like it helps improve the experience both for new contributors who can use it to help guide them to submit code I’ll be able to merge, and for experienced contributors and mentors who then don’t have to spend as much time helping people get started and dealing with minor code nitpicks during code reviews. As an open source security researcher, I feel like it’s a pretty powerful tool to help improve code quality and security with easy feedback to developers before we even get to the manual code review stage. As a developer, I like that it helps me follow any project’s best practices and gives me feedback so I can fix things before another human even sees my code.





I hope other people will have similar good experiences with pre-commit!





terriko: (Default)
This is crossposted from Curiousity.ca, my personal maker blog. If you want to link to this post, please use the original link since the formatting there is usually better.


I’m starting a little mini-series about some of the “best practices” I’ve tried out in my real-life open source software development. These can be specific tools, checklists, workflows, whatever. Some of these have been great, some of them have been not so great, but I’ve learned a lot. I wanted to talk a bit about the usability and assumptions made in various tools and procedures, especially relative to the wider conversations we need to have about open source maintainer burnout, mentoring new contributors, and improving the security and quality of software.





So let’s start with a tool that I love: Black.





Black’s tagline is “the uncompromising Python code formatter” and it pretty much is what it says on the tin: it can be used to automatically format Python code, and it’s reasonably opinionated about how it’s done with very few options to change. It starts with pep8 compliance (that’s the python style guide for those of you don’t need to memorize such things) and takes it further. I’m not going to talk about the design decisions they made but the black style guide is actually an interesting read if you’re into this kind of thing.





I’m probably a bit more excited about style guides than the average person because I spent several years reading and marking student code, including being a teaching assistant for a course on Perl, a language that is famously hard to read. (Though I’ve got to tell you, the first year undergraduates’ Java programs were absolutely worse to read than Perl.) And then in case mounds of beginner code wasn’t enough of a challenge, I also was involved in a fairly well-known open source project (GNU Mailman) with a decade of code to its name even when I joined so I was learning a lot about the experience of integrating code from many contributors into a single code base. Both of these are… kind of exhausting? I was young enough to not be completely set in my ways, but especially with the beginner Java code, it became really clear that debugging was harder when the formatting was adding a layer of obfuscation to the code. I’d have loved to have an autoformatter for Java because so many students could find their bugs easier once I showed them how to fix their indents or braces.





And then I spent years as an open source project maintainer rather than just a contributor, so it was my job to enforce style as part of code reviews. And… I kind of hated that part of it? It’s frustrating to have the same conversation with people over and over about style and be constantly leaving the same code review comments, and then on top of that sometimes people don’t *agree* with the style and want to argue about it, or people can’t be bothered to come back and fix it themselves so I either have to leave a potentially good bug fix on the floor or I have to fix it myself. Formatting code elegantly can be fun once in a while, but doing it over and over and over and over quickly got old for me.





So when I first heard about Black, I knew it was a thing I wanted for my projects.





Now when someone submits a thing to my code base, Black runs alongside the other tests, and they get feedback right away if their code doesn’t meet our coding standards. It hardly any time to run so sometimes people get feedback very fast. Many new contributors even notice failing required test and go do some reading and fix it before I even see it, and for those that don’t fix issues before I get there I get a much easier conversation that amounts to “run black on your files and update the pull request.” I don’t have to explain what they got wrong and why it matters — they don’t even need to understand what happens when the auto-formatter runs. It just cleans things up and we move on with life.





I feel like the workflow might actually be better if Black was run in our continuous integration system and automatically updated the submitted code, but there’s some challenges there around security and permissions that we haven’t gotten around to solving. And honestly, it’s kind of nice to have an easy low-stress “train the new contributors to use the tools we use” or “share a link to the contributors doc” opening conversation, so I haven’t been as motivated as I might be to fix things. I could probably have a bot leave those comments and maybe one of those days we’ll do that, but I’m going to have to look at the code for code review anyhow so I usually just add it in to the code review comments.





The other thing that Black itself calls out in their docs is that by conforming to a standard auto-format, we really reduce the differences between existing code and new code. It’s pretty obvious when the first attempt has a pile of random extra lines and is failing the Black check. We get a number of contributors using different integrated development environments (IDEs) that are pretty opinionated themselves, and it’s been freeing to not to deal with whitespace nonsense in pull requests or have people try to tell me on the glory if their IDE of choice when I ask them to fix it. Some python IDEs actually support Black so sometimes I can just tell them to flip a switch or whatever and then they never have to think about it again either. Win for us all!





So here’s the highlights about why I use Black:





As a contributor:






  1. Black lets me not think about style; it’s easy to fix before I put together a pull request or patch.




  2. It saves me from the often confusing messages you get from other style checkers.




  3. Because I got into the habit of running it before I even run my code or tests, it serves as a quick mistake checkers.




  4. Some of the style choices, like forcing trailing commas in lists, make editing existing code easier and I suspect increase code quality overall because certain types of bug are more obvious.





As a an open source maintainer:






  1. Black lets me not think about style.




  2. It makes basic code quality conversations easier. I used to have a *lot* of conversations about style and people get really passionate about it, but it wasted a lot of time when the end result was usually going to be “conform to our style if you want to contribute to this project”




  3. Fixing bad style is fast, either for the contributor or for me as needed.




  4. It makes code review easier because there aren’t obfuscating style issues.




  5. It allows for very quick feedback for users even if all our maintainers are busy. Since I regularly work with people in other time zones, this can potentially save days of back and forth before code can be used.




  6. It provides a gateway for users to learn about code quality tools. I work with a lot of new contributors through Google Summer of Code and Hacktoberfest, so they may have no existing framework for professional development. But also even a lot of experienced devs haven’t used tools like Black before!




  7. It provides a starting point for mentoring users about pre-commit checks, continuous integration tests, and how to run things locally. We’ve got other starting points but Black is fast and easy and it helps reduce resistance to the harder ones.




  8. It reduces “bike shedding” about style. Bikeshedding can be a real contributor to burnout of both maintainers and contributors, and this reduces one place where I’ve seen it occur regularly.




  9. It decreases the cognitive overhead of reading and maintainin a full code base which includes a bunch of code from different contributors or even from the same contributor years later. If you’ve spent any time with code that’s been around for decades, you know what I’m talking about.




  10. In short: it helps me reduce maintainer burnout for me and my co-maintainers.





So yeah, that’s Black. It improves my experience as an open source maintainer and as a mentor for new contributors. I love it, and maybe you would too? I highly recommend trying it out on your own code and new projects. (and it’s good for existing projects, even big established ones, but choosing to apply it to an existing code base gets into bikeshedding territory so proceed with caution!)





It’s only for Python, but if you have similar auto-formatters for other languages that you love, let me know! I’d love to have some to recommend to my colleagues at work who focus on other languages.

Page generated Jun. 15th, 2025 07:43 am
Powered by Dreamwidth Studios