Mike Liddle

Projects, Tutorials, Experiences, Software, and more

All posts in one long list


Cross-Signed Certificates

Who do you trust?

One of the earliest internet protocols was http. Very quickly, engineers discovered a need to secure content that they sent/received. This lead to the development of the HTTPS, SSL, and TLS protocols. One of the main features to this protocol is the reliance on X.509 Certificates1. The idea behind certificates is that developers can go to a central authority, request authorization, and have a certificate that others can trust, if they trust that central authority.

This works just fine if every connecting agent has trusts that central authority, however, due to the complexities of X.509 Certificates, trust verification, and the diversity of revocation checking methods, certificate revocation is effectively broken for websites, with most major browsers performing a “best effort” check if they check at all.

Sometimes Certification Authorities are compromised though. In 2011, a reseller (RA) for Comodo (one of the largest CAs) was compromised, allowing several fraudulent certificates to be issued and trusted for google.com and other major domains. Later that year, DigiNotar was also compromised, issuing more fraudulent certs. The mitigation was to revoke the intermediate cert/root cert that was compromised, and remove the certificate from each computer’s trust store. Doing so, however, eliminated trust for all sites who could be trusted, but had certificates issued by Comodo’s reseller or DigiNotar. So those sites were effectively offline until their certificates were re-issued (which is not a definite time frame).

Cross-Signing

Cross-signing a certificate allows for a private/public key pair to be signed by multiple CA’s. By getting a cross-signed certificate, you effectively have two certificates that can be used interchangeably depending on trust. In a distributed environment, each node could have a trust authority, with which it issues a certificate. That certificate can then be used to verify trust as, if we trust our peer, and they trust another node, we can trust that node.

Along the same lines, having multiple certificates allows for redundancy in a system. If our noe has a certificate issued by Comodo and Digicert, and our system was operating during 2011, once Comodo was removed from trust on the different nodes, the nodes would continue to operate uninterrupted because of the redundant certificate.

This issue is rare, however, with computational power increasing every year, as well as connectivity speeds, it is increasingly difficult to detect a compromise. This leads many systems to rotate certificates on a regular interval so that even if the certificate is compromised, it can’t damage the system for any significant period of time.

Designing a Web API

Where to start

In designing the Web API for TrashTalk, there are many thing I would do different. With that disclaimer, the approach I took was making the best of what I had. I was given a team of 7 students and whatever we knew to work with.

The first step was to understand what the web application needed to do. I met with the founder of this company and reviewed the requirements and put into words a more formal specification. I then chose a chief architect from the team and we set out on creating an architectural document and specification. We designed the initial database and outlined the routes and how the routes should look, then turned to the team and started working.

The first step to any good web API is a clear specification of functionality and resources. Without that, you’ll never know what to build.

Technologies

My dad, a Ph.D. in Computer Science, recently asked me about our technology decisions going into this project. My answer to him as to why we chose Vue and Flask on an Apache server was, “That’s what most people knew.” Our front-end developers all knew Vue, and one was very proficient in setting up a Vue project. Likewise, the back-end developers all knew python, and one had experience setting up a flask project from scratch. I had experience setting up, configuring, and maintaining Apache servers. So the choice was made. Use what you know when you need to move fast.

Now, reflecting on our choices, the technologies employed fit our needs really well. Our Database is currently hosted through AWS using their database tools, and our server is an EC2 virtual server. The plan is to migrate the code to a serverless setup, reducing expenses and increasing performance. Flask applications transfer to serverless systems very easily so the choice to use Flask as an application framework was great!

Vue.js has been less promising. We serve the pages statically, but being a SPA, it’s still hard to get a lot of good SEO on our page, though not too difficult, and it takes a little while to build the site and migrate, but using a separate build server and deploying the static pages only would speed up the deploy process and should be happening already.

Databases and CAP Theorem

We chose a relational database for a couple of reasons:

  1. We can afford to localize data into separate databases per customer or groupings of customers.
  2. RDBMS’s are very fast at filtering data and getting specific pieces of data or joining related resources even without direct id’s.
  3. We knew how to run SQL queries.

Looking at the data and how we use it, a relational database is an ok fit, excluding two use-cases, and we care most about availability. Partitions are easy to handle by having multiple databases and a serverless architecture. Localities can be directed to different regions on AWS and multiple services can cooperate easily without knowing about each other. Availability is achieved by moving serverless and having multiple databases and auto-scaling those. These are known problems.

A note on SQL vs NOSQL

The data we are using is largely relational however, using a NOSQL database might fit better for our use case on getting the container status. If I have a database table for sensor events, telling us fill-level, and containers, telling us where the container is and how big it is, I need to perform a few Left Outer Joins in order to get the resource as we would like it. Previously, we used firebase, and needed to loop over the entire dataset to find which event was the most recent and then use it. This was almost just as bad as what we have moved to. Instead of this, we’ve discussed having a NOSQL database that would store the most recent event for each sensor. This removes the need for Left Outer Joins, and speeds up the process of getting a container’s status, which occurs very frequently.

Another potential use-case is our routes. A route consists of an ID, company information, and an ordered collection of containers. Since order is significant, we want to have an easy way of enforcing that, which could be acheived in a NOSQL database. This has been done in SQL databases before, and works there, but we could simplify this with a NOSQL database.

Resources

This will be covered in my next article.

The Trashcan Problem

The Problem

I currently consult a local startup company that uses IoT sensors to monitor the status of an organizations trashcans and dumpsters. More important than trashcans is likely dumpsters, however, since more money is wasted by collecting an empty dumpster as opposed to an empty trash can 50 feet away from the next.

IoT systems create lots of data, which can be really fun to analyze. I have been doing a lot of data science lately, and I realized there are a lot of features we can add that may help the trashcan situations. Let’s say that I have a nice campus, like Disneyland, and I want to make sure that if someone has trash, they can throw it away. How can I ensure that I have my trashcans in the right place?

Method

We are going to look at several similar existing problems to see if there is already a good way of solving our problem. Likely its been solved, but there’s a chance that it hasn’t been formalized and we will have to adapt an existing problem or algorithm to our situation. The goal here is to fully understand our problem, it’s complexity and it’s simplicity. If we can avoid a complicated algorithm, its best to do that.

Set Cover Problem (NP-Complete)

Immediately, this seems to be an NP-Complete problem. I need to cover another set by adding a trashcan. This scared me at first, however, since Set-cover is a known NP-Complete problem. So with the 100’s of trashcans that Disneyland has, we’re exceeding practical run times. But then I realized that’s not my input-set cardinality, that’s my solution set size. How can we choose a way to represent the sets? Maybe we can triangulate between restaurants and rides to call each significant point a point in our sets and specify a bound of 50 feet to allow a trashcan cover a ride, but that doesn’t take into account the amount of trash or popularity of the area. So I decided that this wouldn’t be the best-fit problem to think of.

Then I found the art gallery problem, which is simply this: Where can I put the minimum number of security guards to monitor every area in a museum. This works well for our problem, since we are an optimization problem, dealing with location and placement, but again, this doesn’t deal directly with popularity, i.e. the Mona Lisa wouldn’t be any more heavily guarded than a child’s hand-turkey in these problems. This problem is also NP-Hard, which means that if we could simplify this problem we might be able to find a good algorithm for it, but thinking of our problem like this will prove to make it inefficient to solve.

K-means Clustering

During my Big Data class, we talked about K-means clustering, which could be a great method of visualizing the data. If we plot the quantity of trash produced as a data-point, with one point for each unit, and the x and y coordinates are simply location, we can create some good clusters to see areas where we are not currently collecting on a map, and given a high density of data points(every trashcan’s data is in the same spot), if we force our algorithm to add a cluster (\(k = k + 1\)) then one of the clusters should divide, right? We will revisit clustering.

Bubble Charts

I bring this up, as this is how I started to visualize the problem. The size of the bubble represents the amount of trash generated in an area, the location of the center of the bubble is the location (lat/lng) of the trashcan. So if I have two distant bubbles that overlap, I need to insert another trashcan in the middle. If I have a bubble that gets too big, I need to add another trashcan nearby to alleviate the load on the other trashcan. The problem here is this isn’t an algorithm or problem, just a visualization. But now we have something we can use set-cover and k-means on. Using a bubble chart, we can plot points proportionally using a gaussian normal distribution and create a dataset that we can then cover with a density based set-cover algorithm or some density-based clustering algorithm.

DBSCAN

This algorithm is great for density-based clustering. DBSCAN will, given an epsilon and min-points, cluster the points accounting for noise, and add clusters as needed. This can be used to analyze how the trash collection is currently, answering the question, “Do I need to add another trashcan?” If the number of clusters produced by a DBSCAN don’t match the actual number of trashcans, we have an abnormal distribution of density in some area. This assumes, however, that you don’t have any trashcans in the same place, or side-by-side.

This algorithm may help us determine if we need to add another trashcan, based on density. But ultimately, that’s over complicating the issue. Why not just look at how fast a trashcan is filling up? That’s not our only question here. We need to know where to place the next trashcan.

Conclusion

Our algorithm will be as follows:

  1. We look at the frequency of collection for all trashcans.
  2. We compare that to pre-established thresholds.
  3. We determine who the “neighbors” are for the trashcan that exceeds the threshold.
  4. We generate random points(Gaussian distribution) using a proportional number of points per trashcan to trash generated per trashcan.
  5. Run K-means on the existing dataset with the three centroids being the current trashcan locations
  6. Add a new centroid and rerun K-means to determine the optimal placement for the containers.
  7. Rerun 3-6 for each trashcan exceeding the threshold.

Analyzing this algorithm we just created, our most complex part is K-means which has a complexity of \(O(k*N*T)\) where T is the number of iteration, k is the number of clusters, and N is the number of samples. This will be repeated for every trashcan, so in worst case, our algorithm will likely be \(O(k^2*N*T)\) Not bad for a set-cover type problem.

This solution ignores infrastructure concerns, such as, “that’s a building, I can’t put a trashcan there” but gives a good solution if we are looking at a campus type location.

The Static Blog

The Static Blog

This is my first attempt at making a static blog. I recently realized that I would have to switch my hosting providers and wanted to find a free way to host a blog and my website. My website was made using static HTML pages and a javascript framework (which was, of course, overkill) which is easy to host from github pages or a similar service. So I changed my DNS entry for my domain and my website was live from github.

The issue came, however, when I decided I wanted to start my own tech blog. Suddenly I had memories of Wordpress come to me and I thought, “hey, maybe a CMS like Wordpress is the best option.” It’s not hard to map a subdomain to another IP and host a site somewhere else, but then I run into the problem that it isn’t free. The problem with a CMS is that you need some sort of back-end management system, whether that’s a MySQL database and templating engine, like wordpress uses, or just a dynamic templating engine.

So I looked on Google how to make a static blog, not using a CMS. I still rely on a templating engine and template1, but since all the pages can be statically generated, I don’t have to worry about having a live back-end system that manages the content. I decided that Jekyll would be the best option, first because Github supports it, but also, it has a large community of support, I can write articles using markdown, and I don’t have to worry about any styling.

My reasoning for blogging is simple: I have experience that others can enjoy as well. I enjoy watching videos on YouTube or reading tech blogs that talk about the cool projects people have worked on, as well as reviews of the latest and greatest technology. Similarly, I enjoy reading the experiences others have in their careers and learning from their experiences.

  1. This template was taken from JekyllDecent by jwillmer on Github