Friday, June 3, 2016

Studying Fake Busses With Science!

My last post was about my experience at this cool hackathon I went to. This post is about what I actually did there. 

All of the code that I used is written in Python and is on my github [Here]. I'd like to think it's pretty readable. You can also run all of the code and reproduce all of my figures in Binder without even having to download anything!


Look its that thing we made


For the hackathon we came up with the idea of using the city's public data to improve bus usage in San Diego. We noticed that certain bus routes were either overused, underused, or had surge times. During commute times in San Diego people actually wait in line for the bus. They get to the bus stop and some of the people don't get to even board the bus when it comes because there are too many people. This phenomenon is whack and presents lots of space for improvement. 


Whack bus line somewhere thats not in San Diego


The Plan

Here's what we set out to do:
* Build an app that the city to use to visualize and analyze bus usage data

* Identify overused, underused, and surging bus lines.
* Explore ways we could partner with rideshare companies to make busses great again (sorry)

We learned that the city of San Diego just started started installing infrared sensors on the doors to the bus to count how many people get on and off bus at each stop. This seemed like a great place to start... but we couldn't get our hands on that data over the weekend with 0 hours notice. Crap. 


Well in neuroscience if you want to study something that there is no data for, the best thing you can do is to build a simulation. And by simulation I mean a big 'ol math equation. This math equation will output realistic data by making certain assumptions about the underlying processes that generate it Then you can analyze the simulated data while you tweak the parameters to see what happens. 

Here's what I decided were the important features of one day in the life of a bus line:
1. Number of stops
2. Max capacity of the bus
3. How many passengers get on at each stop
4. The stops where the bus fills up
5. Number of surges in the bus line
6. When the surges happen


This seems like a lot of stuff to stuff into a math equation but it's actually not too bad if we assume bus demand is governed by A: some combination of random noise (riders randomly get on and off stops irrespective of the time) and B: oscillations (rider demand increases and decreases over time). I whipped up a simulation of this [here] but if that's not your thing then I guess you can just take my word for it.



Fake Busses

Using my algorithm I could create daily bus routes and change different things about them. Here are some examples: (blue line is the number of passengers on the bus at each stop, red dots are where the bus reaches capacity of 40)


If I make ridership completely random it looks like this:



And if I make ridership completely oscillatory it looks like this:



Both of those look kind of artificial but if I find the happy medium between noise and oscillations it looks a little more realistic.





The previous models all have the same number of times that they fill up. I can also change the overall demand without changing the timing of demand. 

High demand:



Low demand: 




Cool. Now we have a bunch of fake buses... Why did we do that again? 

We made these so that we can use them as templates for figuring out how to analyze them. If we only have the data, and we don't have access to the parameters that generated it, can we infer them? In other words, is there a way that I can give you numbers that characterize the noisiness, 'surginess', and demand of a bus line?



Demand

Demand is easiest so I'll talk about it first. If we want to know the total passengers who got on a bus in a given day, we just have to find the integral of the number of passengers on the route over time. There is a handy function for that called sum().

The more important feature of demand is how many times the bus filled up. If a bus line has lots of demand but doesn't fill up then thats perfect. I want to know when the bus is too full. I did this by set a threshold at the max capacity in for the bus and each time the amount of passengers on the bus reaches that number I record the times. It looks something like this:


These are two bus routes that both filled up 20 times but the way they filled up over time is totally different. The blue one filled up randomly, but the red one filled up in bursts. It's obvious to a human which is which but if we had 1000 bus lines that we didn't want to manually comb through, could we write an algorithm that would tell us if a bus line is noisy or surgey automatically?


Noise


Yes we can. Some neuroscientists were faced with a similar problem in analyzing spike times in neurons and figured out you can quantify 'burstiness' using the coefficient of variation of the inter-spike intervals (CV).

Check it out. If I take a purely surging bus line and iteratively add a little bit of noise, CV tracks this change pretty well - before it gets too noisy.



Surges

Cool. Now we can use CV automatically tell us how noisy and surgey busy bus lines are. But wouldn't it be helpful to know how many surges there are and when they happen? It would, dear reader, and we can do that pretty well using k-means clustering. If you look through my code you'll see that I tried the traditional approach of finding the optimal k using the elbow method but I couldn't get that to work so I made up my own thing. 

Here's an example of my method identifying the surge times in a bus line:


It looks like it works but I totally could have lied so here's a little more proof - here's how it does on bus lines with between 1 and 5 surges and 50% noise:
Dope


Summary, Problems, Caveats, and Future Directions

Overall I did a pretty good job. I was successfully able to quantify demand, noisiness, and surges in simulated MTS data. By doing this, I could take large amounts of ambiguous 1's and 0's and turned it into actionable information someone can use to understand and hopefully improve the system. 

However, there are some important things to acknowledge. First of all, my simulation is definitely still a little artificial. Bus surges are probably not totally sinusoidal and I don't think surges ever happen at 3:30 AM. This feature shouldn't really affect the way I analyzed the data but it makes the simulations seem a little less genuine. Moreover, as the real data comes in, it might turn out that noise/oscillations are not the best way we can describe demand. 

CV is also a pretty finicky measure. It is highly sensitive to both the length of time being analyzed and the total number of points where the busses fill up. I didn't think this would happen and I have been using this measure in real science code so I gotta go make sure it really makes sense.

There are much more sophisticated ways to characterize peaks in the presence of noise, (somebody might even publish an algorithm that does just this for neural oscillations someday). There are also several other machine learning algorithms one could use to analyze bus ridership that I didn't get to for the sake of time. For instance, longitudinal ridership could be clustered over weeks and months, and we could do outlier detection for unusual surges. 




Feel free to mess around the code and let me know if I made any mistakes!

Monday, May 23, 2016

My First Hackathon

I went to my first hackathon, The San Diego SmartCity Hackathon, this weekend and it was absolutely incredible. 

If you're interested in the actual project I did, click here and the code for it is here.
This post is more about the actual experience of attending, which I found to be really meaningful.



Here are the takeaways:
  • GO TO A HACKATHON. This entire post will be mostly be a demonstration of why you have got to go to one of these things no matter how technically incompetent you think you are.
  • I thought I would be intimidated but the environment was extremely supportive and inclusive
  • Wizards are real and they are at hackathons.
  • I joined a team, made some friends, and contributed to a project that I'm really proud of.
  • We didn't win but it was still super worth it.


I almost didn't go to my first hackathon this weekend.


On Friday afternoon I remembered that I had registered to go to some "hackaton" thing on campus that evening. The keynote speaker was someone I had heard about and I was hoping there would be free snacks and coffee. I was tired from the week and wanted to just go party but I thought I might as well check it out because I had a ticket. 

If you aren't familiar with the term, a "hackathon" is an event where nerds amass to spend a weekend building tech projects together. I had heard legendary stories of people who conceived of an idea at a hackathon in San Francisco, worked on it all weekend, and got hired to work on it full time before Monday. I didn't think this kind of thing was real.

I had no intention to participate. I just wanted to consume the coffee, snacks, and free entertainment that I was entitled to. I get to the room where this guy is going to speak and there was free beer and tons of hor d'oeuvres. Nice beer and legit hor d'oeuvres too. Nice. I stood in the corner, stuffed my face, and just kind of observed. The room was full of about 100 people, 90 dudes, 125 computers (some desktop boxes that people must have lugged from home), 10 cameras, 20 pieces of technology that I couldn't quite identify, and one big virtual reality setup taking up a corner. I started chatting with some dude as we were all ushered to the auditorium. 

This dude, Rushil, had been to a couple of these things and programs mobile apps for fun. As we walk to the auditorium, some older guy in a suit overhears our conversation about data visualization and starts telling us about his tech company. After a few minutes, he gave us his card and offered us summer internships on the spot. Woah.


Joining the hackathon


The featured speakers talk for a little bit and it's pretty cool. Then the director of the hackathon invites any teams who want to pick up additional members to present their projects. A couple people get up and talk about what they plan to work on. 

The theme of the hackathon is to help San Diego come up with ideas for how to deliver on it's Climate Action Plan. It turns out that California has a shit load of environmental problems, believe it or not. The city just released a ton of infrastructure data to the public and was interested in whether these kinds of data could somehow be used to ameliorate some of the city's environmental problems. It also turns out that people who are highly technically competent like to solve these kinds of problems just for fun on a weekend, basically for free.

The project presentations begin. They range from, "We're pretty much an early stage startup trying to build a website to gamify environmentalism and looking for anyone who wants to join", to "Wildfires are a problem. Maybe we can use drones or sensors or something?"

Rushil and I start to talk about what is good and bad about certain ideas. He's all fired up to join a team and win this thing. I'm embarrassingly buzzed from my Stone IPA and entertaining the idea of hovering around a team and checking out what this is all about.

After the presentations, we introduce ourselves to the leaders of the groups we're interested in joining. We talk to the drones/something/wildfires guy, whose name is Victor and start talking about how we could work on this. Then we start bouncing other ideas off of each other. We come up with some pretty cool projects but realize we have to think about what kinds of projects we could do with our specific skills.

These are our skills:

Rushil: 
* Has completely programmed dozens of mobile apps
* Well-versed in business, economics, and tech entrepeneurship

Victor:
* Experienced freelance full-stack web developer who can build literally anything on a computer
* Made the website for the event
* Is friends with all of the organizers and seems to know the whole SD tech community

Torben:
* Claims to be a "neuroscientist" and "data scientist"
* Has no idea what's going on
* Still kind of buzzed from one beer

Seems like we're all about the same level. Nice.

Hacking


So I guess we're a team now and we head into the coding room to work on ideas together. 

Holy crap. More free stuff. T-shirts, Sandwiches, pizza, high-quality granola bars, Red Bull, soda, and chips. I switch from beer to Red Bull and we start to talk about what to work on. 

We try to solve forest fires. Forest fires are complicated and none of us know how to do hardware engineering. Crap. Next we get really excited about making a fitbit for your house. You'll get points for using less energy, water, and waste. Then you can post that shit on Facebook and brag to your friends. Victor and Rushil can build it and I can do stuff with the data. Crap somebody already made that. Ugh. We think about making an online game that would optimize recycling bin locations and joke about how funny it would be if we could trick people into picking up trash off the ground to get points. Suddenly it's 12:30 AM and we have no solid idea. I'm thinking of quitting my new team. 

Then we think about if there's any way we can use the city's data to improve the bus system and maybe partner public transit with ride-sharing services like Uber or Lyft. I'm not super excited about it but I can't prove to them that it's a dumb idea not worth our time. We head out a little after 1 AM, add each other on Facebook, and message each other til about 2 AM.

I wake up to my phone alarm at 7 AM on Saturday. We had planned on meeting at 8 AM.
"What am I doing awake right now? Am I really going to go back to this thing? I don't even really know these people, our idea is super fucking dumb, and we came up with it in the middle of the night."
I somehow convince myself to get up and head back.

When I get there, I see people who seem like they had been there all night. There are people programming the crap out of some complicated looking stuff and all we have is a weak idea. Somehow the magic happens and we start to get this project rolling. Rushil and Victor start building the backend of a website we can use to visualize bus usage data. I start modeling bus use data and coming up with analyses we use to characterize it. We absolutely crush work for most of the day, not even stopping for lunch.

We meet again in the morning and put on some finishing touches. The 'finishing touches' take way longer than we expect. We frantically finish up our project and barely make the deadline - coding until the second the judges arrive to evaluate us. 


Presenting the project



Our final project kicks ass. It's a dashboard that a city controller can use to visualize bus usage on all of the city's bus routes using the new passenger counter sensors they're adding to busses. We can algorithmically label high-load and surging bus routes (post about this, code I wrote to do this). We have economic value models explaining how a partnership with Lyft could help with bus use surges to make public transit more reliable. The app looks nice.


The first judge comes over. Rushil gives the elevator pitch. Victor shows off his web platform. I begin to speak on the caveats of modeling surges in bus demand at length. I'm cut off quickly. I didn't even get to brag about clever math I used because we only have 4 minutes total to present. This is not a science conference.

Our pitch becomes better with each judge and by the end we sound pretty darn smart.

While the judges deliberate, I look at a couple of the projects and they're unreal. I was completely in awe of what these people could make in a weekend. One group was using sensors to optimize water efficiency when watering plants with hydroponics. One group had the idea of displaying the city's water supply on the treasury building using LED's so the building would look like a giant, kind of empty, cup of water.

The finalists were announced. Unfortunately, we were not chosen but the ones that were chosen were pretty awesome. Remember when I said we were joking on Friday night about making a game out of throwing trash away? The group with the virtual reality setup made a game where you throw objects either into the trash or recycling in virtual reality. They got picked. It actually looked pretty rad. Damn it. One group built a low-cost sensor that could be put in forests to detect forest fires and instantly communicate with the fire department. These sensors could also be repurposed to help 911 identify your location when you call far from a cell tower. They were not even finalists. 
One group received an honorable mention and was publicly offered a job with the city.

We didn't move on to the next round, but we received a lot of positive feedback about our project and we talked to a few people about what our next steps could be. We'll see what happens.

Go to a hackathon


Before I got there I was worried that I would be laughed out of the room for not knowing anything about tech. I was accepted with open arms and I found that I had a niche set of skills that were necessary to make our project work. Not many engineers have the skills that scientists do! Moreover, many projects were in dire need of non-techy skilled individuals like design specialists and people who could come up with projects and keep them on track.

Every person there seemed to be genuinely interested in helping other people out. Staff members and participants were frequently helping out with each other's projects. I heard the following conversation dozens of times: "Oh you're trying to do this? I know all about this. You should use that! Here's how you do it...(technical jargon for 5 minutes to half-an-hour)"

This was one of the best events I've ever been to. The breadth and diversity of technical knowledge at this small room could rival the entire Annual Society for Neuroscience Conference. The encouraging, friendly, and positive atmosphere could probably outdo most music festivals. I found myself in a group with two new friends and we built something in a day that I couldn't have done in a year on my own. 

I got a whole lot more out of this experience than the snacks, coffee, and lecture that I came for.

Friday, March 11, 2016

Trying Electrophysiological Analyses on Stock Market Data

In this post I'm going to analyze data from the stock market using some of the techniques that I typically use to understand brain data. I can't think of a very good reason for why this makes sense or why it should work... but I'm going to do it anyways and you can't tell me what to do.
What you get when you google "crazy stock market people"
A lot of the neuroscience analyses that I do involve analyzing how a signal changes over time. Specifically I look at what voltage does inside or next to brains. Basically, I investigate how different parts of the brain interact with each other based on how their local voltages change over time.

This lead me of think of how different stocks in the stock market might interact. Like voltage recordings in a brain, stock prices move up and down over time and they are spread out over a wide range of industries in the economy. It also seems pretty likely that they might interact with each other in interesting ways that I might be able to characterize using certain analyses. Let's see if I can do that.

Like always, my code for these analyses is on my github here: https://github.com/torbenator/ephys-for-stocks

Getting historical stock market data:

I found this helpful website for how to algorithmically download historical stock market data here: http://www.quantshare.com/sa-43-10-ways-to-download-historical-stock-quotes-data-for-free
Unfortunately, I couldn't get any of the techniques that they described to work except for one, so that was nice.

First problem: What data do I analyze?


There are lots of stocks and I don't know anything about stocks. But I saw a commercial for this thing called sector spiders http://www.sectorspdr.com/sectorspdr/ when I was watching lacrosse once. From what I understand, this company organizes stocks into categories as a way of diversifying your portfolio. This seems like a reasonable way to split the data into a stock market version of "functional networks". I picked 5 random stocks from each category as a small sample to analyze. I could have picked more but I got tired of googling their tickers and wanted to get going. Of the 50 I tried to download I got 41. Certain stocks had different names in the past year and made my parsing method fail. Also I could only download 1 year of data using my scraping method. Such is data science.


It's time for a picture. Here's what the data looks like:


First Analysis: How do stocks correlate with each other? 
i.e. when one stock moves up, does another one reliably move down?

If we correlate all the data we get this:

From this visualization I can see that most stocks correlate positively with each other. This means that most stocks tend to move together - when one goes up, the other goes up as well. But, for some reason, some stocks seem to have really negative correlations across the board.


Here's another (better) way of visualizing the same data but squeezing the y-axis down:
Side-note for python users: There are some ridiculous colors built into matplotlib (
http://stackoverflow.com/questions/22408237/named-colors-in-matplotlib)
Colors I used in this plot included 'lemonchiffon','peachpuff', and 'papayawhip'.

When I sort the stocks by their mean correlation with all other stocks, I can find the stocks that have the highest correlation with all other stocks.
The five stocks with the highest mean correlations are (in order):

Dentsply, Republic Services, Facebook, Nasdaq, and Northrup Grumman

Because they correlate the highest with all other stocks in my sample, it means that they are the best indicators of general stock market activity. In other words, these stocks are the best at speaking for the general trend in the sample. Something like Nasdac makes sense here because its a reflection of a huge chunk of the economy but who would have thought that Dentsply, a dental prosthetics company, would be the best indicator of market activity in this sample?

Conversely - what's up with those stocks that correlate negatively with everything?

The five stocks with the lowest mean correlations are (in order):

Nisource, Exxon, Newmont Mining, Cigna, and Phillips 66

I remember my finance friends telling me that gas prices were generally indicative of a growing economy; when people buy more gas it means they're building, manufacturing, and traveling. But this analysis shows exactly the opposite. When stock in gas companies goes up, values of most other stocks go down. Weird!
Perhaps this is due to my tiny sample size of stocks and extremely short time window. I bet with more data this would get cleaned up. Moving on.



Next question: What are the relationships between stocks within categories?
i.e. are technology stocks all highly correlated with each other while materials stocks aren't?



This figure is pretty similar to the last one but you'll notice that the correlations are generally higher.

Utilities, Financial Services and Energy were by far the least consistent, while Technology, Consumer Discretionary, and Real Estate were the most consistent. Nothing too crazy here but I guess it's pretty neat.


Now onto some ~fancier~ analyses:


What does the power spectrum of stock data look like?

Spectral analyses are things that nerds use to understand periodic signals. And they're fucking rad. If you want a brief explanation of spectral analyses, go here: 
https://github.com/voytekresearch/tutorials/blob/master/Power%20Spectral%20Density%20and%20Sampling%20Tutorial.ipynb
This section might not make much sense if you aren't familiar with frequency domains.

The power spectra of one year of all my stock data look like this:
Power in the signal is distributed towards the low frequency oscillations and has a 1/f slope. (Look up 1/f slopes if you don't know what those are. They are very cool). However there is also a spike at the high end. This means that all of the stocks in my sample generally change over large time windows, but day to day fluctuations are also very important. The fact that these peaks are on either end of my frequency domain means that there is lots of information at scales that I don't have access to. Clearly a lot goes on within each day to influence a stock's value but all of that information gets clumped into the tail here. Likewise, fluctuations of a stock's value over a multi-decade window would provide better estimates of power at larger time windows.

Personally, I can conceptualize this observation better with the following figure where I (unorthodoxly for neuroscience) flipped the x axis. To me this makes these time windows a little more intuitive.

After looking at the spectra of all of the data, its clear that there is a lot of variation in the 1/f slope of the data. Some stocks are flatter and some are steeper. This has real meaning because it tells us how much of a stocks change in value occurs over different ranges of time. Steeper slopes indicate that a stock is a more stable investment but flatter slopes indicate that it is more volatile. I happen to have worked on an algorithm that automatically calculates 1/f slope in these kinds of data in neuroscience so I ran it on this because I guess that's the point of this post.
Exponents of 1/f slopes in stock data 


Here we can clearly see that stocks have a wide range of slopes. Furthermore, there seem to be trends within certain stock categories. Consumer Discretionary stocks tend towards steeper slopes, indicating stability, while Utility stocks seem to have flatter slopes, indicating volatility.

Phase amplitude coupling in the stock market

Phase amplitude coupling is an analysis you can do on electrophysiological data to see how oscillations of different frequencies interact. Specifically, you can see how the phase of a slow oscillation influences the amplitude of a fast oscillation. (see https://github.com/voytekresearch/tutorials/blob/master/Phase%20Amplitude%20Coupling%20Tutorial.ipynb) 
We have pretty good reasons to believe why this makes sense to study in brains. I have no idea why this should make any sense for stock market data but it makes pretty-looking figures. Here is the mean comodulogram (calculating phase amplitude coupling for many phases and amplitudes) of all of the stocks:


Interestingly, there seems to be a some phase amplitude coupling in the stock market! The amplitude of 2-3 month oscillations interacts really strongly with the phase of three-quarter year oscillations. I have no intention for why this would happen in reality. This probably happened because I duplicated (and left-right flipped) the data in order to have enough samples to run this analysis. I don't believe this graph but I think it looks pretty cool.


Conclusions and Future Directions

Do not take anything I did here seriously. Please. I do not claim to be a digital signal processing, financial, or even neuroscience expert. I ran a handful of hacky analyses on an extremely limited amount of data. This was mostly a proof of concept to see if I could find anything at all. That being said, there's something here and these analyses could be helpful jumping-off points. One year of data of only 41 stocks simply isn't enough to characterize much but it's clear that we can use neuroscience inspired analyses to identify:

  • Stocks that are markers or outliers of general market or category activity 
  • Consistency among stocks within and between categories
  • The stability of a stock's value over time using spectral analyses


I am interested in pursuing this a little further. Does anyone know how to get more historical stock data without paying? Does anyone reading this know anything about finance and would you be willing to educate me on how to improve the nonsense that I ran here?



Again all data and scripts that I used to run this analysis are on my github here: https://github.com/torbenator/ephys-for-stocks