Alex Gajewski

alex at alexgajewski dot me, @apagajewski

Right now I'm the ceo of SF Compute, where we're working on automating the scaling of neural networks. I generally think AI will turn out best if there are as few barriers to making a state of the art model as possible, and if there are lots and lots of companies training large models of different types.

Before that, I founded Metaphor, where we trained a big contrastive model over a billion pages on the internet to make a neural search engine. I also helped out with the first batch of AI Grant.

(If you're interested in funding an open source GPT-4, let me know!)

Here are some things I think would be cool to build (let me know if you'd like to build any of them :D):

a more abstract programming language for deep learning
- most languages are designed to make things like binary operations short, but now we mostly care about matrix/vector operations
- it doesn't take a lot of words to tell someone the main idea of a deep learning paper, but it sure takes a lot of code
- can you build a compiler to go from an abstract description to the specific details?
- e.g. you usually don't care about the details of a u-net architecture, you just want to set the number of parameters and for the compiler to make reasonable choices for the rest
AlphaZero for math, theorem proving with no human data
- in 2015, Go was too hard of a problem for then-current deep RL methods, so we were forced to invent MCTS to solve it
- today, theorem proving is in a similar place, so it's a good problem to sharpen our methods against
AI characters that authors can put months/years of work into
- right now, the only input you provide to an AI character is a prompt, which just limits how interesting the characters can end up
- you should be able to put way more information into them, either via fine-tuning or just longer contexts
- maybe they should be goal-directed with RL, trying to get you to say a certain thing or to move the story in a certain direction
- really these are just a new form of literature
voice to voice models that think in sound
- right now people are doing voice with a whisper->gpt4->elevenlabs pipeline
- really it should be one big end to end model, so that the part that does the thinking knows how you said something
- you want it to be able to interrupt you, talk over you, etc.
write a history book automatically
- it feels like language models are almost at the point that they could do the research, organize it, and put it into book form
- it's possible you want to fine-tune the whole process on actual history books with RL, so it learns how to do the latent research behind existing books
- easier if you have models with super long context windows
language models with true billion token context windows
- feels like it's probably possible
- you need to store 1B keys and values in GPU memory, this is about 16TB with hidden dim 4096, easily fits if you're training on 2048 A100s
- then you just need to figure out a way to select which ones to attend over for any given token that doesn't cost too much computation
- also need datasets with sequences 1B tokens long (a book is maybe 200k tokens)
  - could construct these sequences by concatenating a bunch of related documents together, so the model has an incentive to find the parts of the context that are most useful for its current prediction
language models with some kind of tree search?
- lots of people are trying to build variants of language models that do some kind of tree search in the solution space of a specific problem, usually programming or math
- I think it would be cool to try to do this in a very general way for arbitrary language modeling, where you just have some kind of learned value function that's trying to predict whether the generated sequence is real or not
end-to-end chip design
- basically an optimization problem, just one that's sort of difficult to write down
- you want to be able to do it as one big RL task, where the model is just trying to make some distribution of programs run as fast as possible
- might even want it to output a gate layout directly
general purpose robotics
- robotics is getting very good!
  - this was trained in simulation and then transfered zero shot to the real world
- general approach might be to put a ton of effort into making a super realistic simulator with thousands of different pretraining tasks, and scaling up a lot
- you might also need to make thousands of physical robots to train on if the simulator isn't good enough
- may help to pretrain on the internet, probably language and video
AGI self driving
- if we want self driving cars to be able to go anywhere in any conditions, we need them to be able to generalize
- probably the thing to do is to pretrain on giant text and video datasets, and then specialize them to driving
- people who work on self driving say you'll never get the level of reliability you need that way, but I think you probably will, someone just needs to try it
- (I don't think Cruise/Waymo are doing this, so it could be a startup)
a big predictive medical model
- train on all the sequences of patient records, scans, tests, treatments, and outcomes
  - this is effectively a very big offline RL dataset
- then like a decision transformer, you can predict which treatment will lead to the best outcome for any given patient
- the hardest part is probably getting access to all the medical records, but if you can coordinate everyone, you probably do a lot better than human doctors
are there methods that look very different from deep learning but have the same general shape?
- neural networks are basically just a class of universal function approximators plus a way of optimizing them (SGD)
- the basic building block is just alternating layers of linear and nonlinear functions
- you could imagine applying SGD to different classes of differentiable functions that don't look like MLPs
- evolutionary algorithms are maybe another class of optimizer, but they're usually pretty inefficient
- you would think there would be lots of classes of functions and optimizers that can scale, and neural networks aren't necessarily the most compute efficient