Before this, I had written a lot of scrapers, but never a crawler, and I'd never tried scrapy. So I wanted to build a crawler with scrapy, use it to generate sentiment analysis of news articles, and see what news agencies were the most happy or sad or neutral.
So far I've written 41 scrapers, some with selenium, some with just scrapy's request system, and added each one to a crontab to run at varied intervals. I used sqlalchemy to store the articles and agencies, and then built a flask site to display the information, with materialize css as a basic css framework to make it look semi-spiffy. Then I added in a word cloud for each news agency for the day, and the result is always running. The front page shows what spider is running at any given time, so sometimes you might see it say "CNN" or "ALJAZEERA" is running.
The initial deployment was on a raspberry pi 4, but it became just too much for the little guy and I moved up to a desktop running headless Arch I had lying around. Then, when I started to get concerned with the future of the project I split it up: the database and website are hosted on Digital Ocean, the scraper runs on my local Arch machine. I use a big crontab to manage all the scrapers (most run every hour, staggered; a few that are longer-running (i.e., using Selenium), run every 2 hours).
I did some back of the envelope calculations to figure out if a cheap basic droplet would have enough storage space for all my articles, and at the rate the database grows it seems I'd end up taking about a year per gigabyte. At 25gb for a basic level droplet, that's about 25 years before I'd have to upgrade storage.
It's up to over a hundred thousand articles now, and one the things I found interesting was how many old articles have been added. It only queries the front page of any news agency, but sometimes they feature articles from their archives. Especially the New Yorker.
One of the most interesting things to do in this project was figure out how to represent the sentiment as a color. I wanted neg/pos sentiments to be displayed as a gradient between red and green. Turns out, you can literally just represent any two points in hex color space as points on a graph and to find the point in the gradient you just take the distance. The code's here, if you want to see it.
The ICC is a web application designed to allow for collaboratively building an exhaustive and definitive repository of annotated literature. I designed, developed, deployed, and continue to maintain the project solo, managing occasionally to rope in some programmer friends for help with various features.It consists of ≈14k lines of code and ≈100k lines of code churn. The backend uses Flask/SQLAlchemy. The frontend uses Jinja2, Sass, and VanillaJS. I also had to write several ETL data pipelines for processing Project Gutenberg texts. It is deployed via Heroku. I learned the entire web application life cycle on this project, and it continues to teach me. It is currently maintained at https://anno.wiki. I am also currently working on a second iteration of the site.
The project consisted of several phases:
- Researching the best base software to build a community upon. We considered several wordpress plugins as well as Codidact, even going so far as to build a feature matrix comparing 20 solutions.
- Launch consisted of modifying most of the templates and building a bit of infrastructure on a VPS.
- Finally, continued development included adding a number of features, including more granular permissions, accessibility features, and making a lot of the site more abstracted to be suitable for launching multiple instances using the same code branch.
One of the first projects I did for HtmMbs, JP, the owner, wanted a way to better manage inventory counts.
Because the company uses Odoo for everything, I had to write a lot of logic in celery to keep everything synced. Adding a module to Odoo wasn't preferrable because (a) this was a very heavy duty application, around 10kloc, and (b) every new module in Odoo just makes the next migration that much harder. Better to use separate services.
Among the features I added:
- An emergency queue and quasi-messaging system for communication between managers and admins to see if locations were being properly allocated and counters were consistent.
- A location allocation system.
- A picking allocation system in React that had a responsive map for allocating special picking locations.
- EOD reports for admins containing detailed statistics about the day's business.
So I got sick of my old website!
The last one (featured in this list as "standingwater.io & blog.standingwater.io" was using the old Gatsby v2 and a starter template from HTML5Up. I decided I wanted a fresh start.
I had a conversation with my boss about style design in which he said his company aims to keep things professional, sleek, and modern, and that that's the image his company strives for. In the context he was poopooing bright colors. My first thought was "then why does your website look like it was designed in 2005?"
So I made this site in Gatsby, from scratch, to be playful and colorful, and I love it.
I took a lot of inspiration from pages on onepagelove.com. I'm still tweaking it, I want more animation, I'd like some url mangling on the projects page so people can deeplink to different filters, but mostly, I like how it turned out, a lot.
There's something about the style, the yellows and round edges and drop shadows, that really reminds me of some educational game I played when I was a kid that I just can't recall now. Perhaps it's this: the Speak & Spell, but I remember an interface that looked like this.
Also, I finally got to include that image of palmettos I've always liked into the home page.
During the 2020 election my friend and I were rapt with the polling changes.
I developed a couple of technologies to monitor changes:
- A site that showed all of the individual polls to track changes within a given poll. This was displayed on a site polling.netlify.app. I had a Jupyter Notebook that would pull 538's polling data csv and parse it to create a ton of graphs that would then be pushed to the netlify/gatsby app.
- A script that ran in AWS Lambda that would, every ten minutes, check 538's topline polling projection and then text me and a friend every time there was a change using Twilio.
This stuff was a hell of a lot of fun. I expect once 538 launches the '22 model, I'll be setting it all up again.
This was a pretty simple project but turned out so beautiful I love it. It was also one of my first paid projects and helped me acquire the contract for working on an interface for a laser system.
This project was for a client who sells various vitamins on Amazon. I developed, and redeveloped over several successive weeks, a rather large web scraping/api scraping system that gleans information about Amazon Marketplace Offers on products the client is selling from multiple data sources.
Because of the increasing complexity of the project it ended up becoming rather large, with a decent amount of features:
- multithreading capabilities
- a token tracking system to make sure to stay within the rate limit
- header spoofing and proxy rotation
- near real time updates for various settings including which ASINS to track and notification endpoints using cloud file storage
- a PostgreSQL database for tracking data (using SQLAlchemy as the ORM, of course).
- database synchronization with an Elasticsearch instance for a Kibana dashboard
The software was deployed to an Arch Linux VPS and used a Systemd Unit to run. I've done automated data gathering (mostly api/web scraping) before, but never on this level. I had a lot of fun, building from the ground up. I'm especially glad for having learned to do multithreading at a decent level.
This project was inspired by a trend on r/languagelearning where people would post world maps with the countries colored by the languages they spoke. I decided to build an actual dedicated web app that would automatically color the countries themselves.
To do this required a lot of data gathering and cleaning. Doing that itself was
an arduous process that had a lot of false starts. The final method I settled on
was to use Google Sheets
importhtml function, which could be used to import
Wikipedia tables to a spreadsheet. To automate this I used Python and gspread.
It was definitely a lesson in data gathering and cleaning.
As part of my continuing obsession with coffee I discovered that I could get a lot cheaper specialty coffee, roasted closer to my personal taste, from the regions I prefer (East Africa), by just roasting myself. Sweet Maria's provides troves of information about how to home roast, and one of the suggestions they make for first timers is to use an inexpensive popcorn popper. They recommend the Nostalgia.
So I bought it, and was dissatisfied with the lack of control I had. I found out that there are a not insignificant number of home roasters who tried hooking up a PID controller to their popcorn popper (or generally modding it), and I found this decent write up on how to use an Arduino to hook up the heating element to a 40A relay and use a thermocouple with an RTU Modbus implementation to communicate with Artisan Scope, a professional grade open source program written in Python for monitoring and controlling roast parameters (more meant for large machines, but...)
There are some problems with Lukas' write up. Namely, that Artisan's internal PID system is really hard to get working. I ended up building in an internal PID using an Arduino library (you can see my code here). So Artisan just sends the PID values and the SV and monitors the temperature. I also found it unnecessary to use a Mosfet between the Arduino and the relay. And for debugging, I added an LCD screen (which is how I discovered that Artisan's PID was never working in the first place).
This project was insane fun. As a programmer, I often get a kick out of some piece of software working. But making something do stuff in meatspace is a rush I'd never experienced before. It was also the first time I'd done real soldering. It was a lot easier than I expected.
- Arduino Uno R3 from Elegoo
- MAX31855 Thermocouple Amplifier breakout board from Adafruit with a K-type thermocouple
- HD44780 LCD
- Inkbird 40DA SSR
- Artisan Scope
- Various Arduino libraries
My father hosts a golf tournament every year known as the DuVall Invitational. It's been running for over fifty years and is mostly played by men in the construction industry and their friends. But as with everything, using the web allowed them to improve the general quality of life in the tournament.
They use a master Excel spreadsheet to keep track of scoring, auction data, and handicaps that is carried over from year to year. A year or two ago I made some massive improvements to the software infrastructure they use.
First, I remade their spreadsheet from scratch, as it had become a crufty mess
of tangled code, with
IF(IF(IF())) style formulas that stretched over a
hundred characters long with excruciatingly difficult logic. I upgraded them to
using the latest version of Excel which allowed me to program it using modern
Dynamic Arrays and Named Ranges. This made drop down selection a lot easier, so
that the list of options in a given cell could be checked against the sheet
itself to filter out used options.
I also created a standalone handicap spreadsheet that allows them to dynamically calculate handicaps by weighting scores from the previous 5 years without having to mess with moving columns and rewriting formulas.
I even made a PR upstream to Sergey when I found that the development server automatically exposes to the network using 0.0.0.0. Unfortunately, the maintainer still hasn't merged it. Luckily, it's lightweight enough that I can maintain the SSG myself without worrying about it being no longer maintained.
All the code for the site can be found at mas-4/duvall.
This project was inspired by a photo that made the rounds of Twitter in the wake of the protests in June 2020. The photo was of John Trumbull's famous painting of the signing of the Declaration of Independence with red dots over the faces of all the slave-owning founding fathers. Only 8 or 9 of the several dozen men in the photo could boast never having owned slaves and John Adams was smack dab in the middle with his face as apparent as ever.
I've always been a bit of a John Adams fanboy so I decided to put my Python to work making a twitterbot that would tweet lines from the letters of John and Abigail Adams, a lovely repository of romantic correspondence.
The bot tweets out fairly regularized sentences in threads that correspond to each letter in chronological order, and then repeats. I used tweepy for the Twitter interfacing, AWS's Lambda to run the program, and AWS's DynamoDB to keep track of where the bot is. The actual letters are hosted on a static site of json files on Netlify. That simplified some of the deployment for various reasons I can't recall now.
Learning French from English is sometimes tricky. Especially with recognizing the difference between certain words, like au-dessous vs au-dessus (below vs above). So I decided to generate an Anki deck with audio files pulled from forvo.com's API that are distinguishable by a single phoneme (the ones I find myself distinguishing). To read my process, please see the README in the Github repository.
The other deck, the numbers deck, consists of 2500 audio files generated through IBM Cloud's text to speech generator. It is the first 2500 numbers in French with their corresponding words and numerical representation. It is sometimes hard to understand French numbers because it uses a mixed decima/vigesimal system (seventy is sixty-and-ten, eighty is four-twenties, and ninety is four-twenties-and-ten).
These projects are simple, but I am proud of them. I seem to have helped a lot of people learning French with them.
The minimal pairs deck, in particular, required a lot of strategizing for dealing with the forvo.com api rate limits (500 requests per day) and the combinatorial explosion of comparing minimal pairs.
- Generated an Anki deck for recognizing phonemes difficult to hear for an American English speaker using “minimal pairs,” words in French which differ by only one phoneme using the forvo.com api. Almost 600 downloads and 3 positive reviews.
- Also generated an Anki deck for training the ear to hear difficult French numbers (which use a mixed vigesimal, or base-20, system) using IBM Cloud's Text to Speech renderer and the num2words Python library. Over 500 downloads and 3 positive reviews.
- Used Requests for automated downloading of the audio files
- Used BeautifulSoup4, requests for scraping the list of minimal pairs in an IPython session
Note: This refers to the previous iteration of this site
I developed this portfolio site and the accompanying blog using GatsbyJS, which I've come to increasingly love for quick static site generation. The GraphQL is, while not the most intuitive, at least pretty useful. Especially the dynamic image processing.
The main portfolio site is built from a modified HTML5 Up template called Dimension, that was ported to Gatsby as gatsby-dimension-starter by codebushi. I have modularized it a bit, like this projects section that uses markdown files and images for each project and an array for ordering, inclusion (there are some projects I don't display for NDA reasons, unfortunately).
HTML5 Up makes some beautiful templates, but their code is often way too tightly coupled and difficult to work with.
The blog I built from the default starter.
The style is inspired by my childhood home in Tampa, Florida. I initially wanted to use a background of palmettos, but it proved too visually difficult a background. I went with a swamp, instead, of which we have many. The site name is a reference to a joke name we gave to the 5 1/4 acre "estate" I grew up on. It puddled quite a lot. Almost looked like a lake.
I also used The Noun Project to purchase the heron icon used in both sites as representative of the area. I'd prefer an Ibis, since they're lousy in the area, but a good one was hard to find.
This site was built for my father's commercial painting company. It was my first project in GatsbyJS and was meant to replace [glendalepainting.com]. Unfortunately, my father is rather slow to update things and he's still using the original Joomla-based site.
The gallery is perhaps the most impressive aspect. A friend of mine helped build that section.
The main page, using react-spring for the parallax, was quite a nightmare. Parallax is not the easiest to design, but it's a dumpster fire when you have to do it for mobile devices using background images instead of sprites.
Never again. Just don't.
A friend of mine had never voted, nor registered to vote. So I made sure she did in 2020 by writing this little script to text her using Twilio to remind her to vote on an increasingly frequent basis as the deadline for registration drew near. She registered pretty quick, actually and it didn't run for that long. Which is lucky, because I ended up discovering a breaking bug on the month roll over.
I ended up having to resort to good old fashioned wheedling and emotional manipulation to get her to actually vote, because she ended up just blocking the number.
This was a one off quick project as a joke idea. It's a python script using the
secrets module from the Standard Library to generate a password including an
English word from the Electronic Freedom Foundation's Diceware list. It
produces passwords that look like this:
The idea was to circumvent the "gobbledy-gook" social engineering vulnerability in Security Questions. Some people just put a randomly generated string for the security, but then the worry is that when an operator on a phone asks for it, they'd just accept the answer "It's just a bunch of gobbledy-gook". This script's idea is that if there's a single recognizable word in the security phrase, the operator would hopefully at least require that.
A better solution would be to just use diceware, but where's the fun in that?
This script came about because of a family tradition of playing Christmas Trivia. The idea was to create a computerized method for achieving the closest balanced teams of arbitrary size and number.
It takes a JSON file with team members and their skill level (this could be any numeric scale, like 1-10, 1-100, whatever), and then uses a generator to get the best match up over time. That is to say, it looks for the next best teams continuously. There's a huge diminishing return on calculating combinations, and it's extremely computationally expensive to calculate all the possible team combinations, but after about five to ten minutes you get pretty good team arrangements, better than if you just did it by hand.