6.7960 | Fall 2024 | Undergraduate, Graduate

Deep Learning

Project Ideas

Themes Project Titles Potential Questions
Inductive bias Data-Induced Constraints Versus Model-Induced Structural Inductive Bias

How Much Data Are Augmentations Worth? An Investigation into Scaling Laws, Invariance, and Implicit Regularization

When is one preferred over the other? Is one strictly better than the other?

Should My Model Be Deep or Wide? Do Wide and Deep Networks Learn the Same Things? Uncovering How Neural Network Representations Vary with Width and Depth
Positional Embeddings in Coordinate Networks Compare different positional embeddings such as periodic and non-periodic ones reconstructing 2D and 3D with coordinate networks.
Initialization How Initialization Affects Learnable Gaussian Kernel Classifier Investigate how various initialization strategies can affect the performance of simple classifier using learnable Gaussian kernels
Optimization Implicit Bias of SGD: Double-Edged Sword?

Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks

Symmetries in Optimization Algorithms

Recent theoretical papers suggest that having a symmetric optimization algorithm hurts learnability by appealing to hardness results in theoretical computer sciences. Design experiments to test this hypothesis at scale by either: 

  • designing asymmetric learning algorithms, or
  • comparing the same gradient-based descent algorithms but under different settings where it is symmetric in some but not in others. 

How do hyperparameter settings (depth/width/…) affect this hypothesis?

Sharpness Aware Minimization Recent papers in deep learning theories have suggested that SGD is implicitly biased towards finding flat local minima and that these local minima generalize better than sharper ones. Design experiments to test this hypothesis under different optimizers, different architectures, and different data regimes (dataset size, dataset distributions, linear separability, geometric representations, etc.) and find conditions where this hypothesis is easy to explain and/or where this hypothesis breaks.
  Steepest Descent On problem set 2 you worked out various forms of steepest descent. See if you can derive other forms using other norms. Can you come up with a principled way to select the norm? Or, empirically test which method works best. See if you can beat the SOTA on speedrunning CIFAR10 or NanoGPT (currently held by steepest descent under the spectral norm).
Regularization Spectral Regularization How can we use the spectral perspective on deep learning (presented in class) to understand regularization and design better regularizers? Can we understand the effect of weight decay on the spectral properties of individual layers, and convert this into an understanding of weight decay on the network as a whole? How does the effect of weight decay depend on the network architecture: e.g., deep MLPs versus residual networks. Are there better ways to regularize than weight decay, based on the spectral perspective? Is there any interesting connection between spectral regularization and parameter count?
Phenomena Biases in Vision/Language Models Joint vision/language models such as CLIP try to align vision and language latent spaces. This provides an extra level of visibility into the representations: for example, for a given image of a cat, its similarity to the text embedding of “a photo of a cat” typically captures how “cat-like” the image is. This project would involve studying the representation space of such models with respect to sensitive attributes/biases. For example, given photos of either men or women, which image embeddings are closer to the caption “a photo of a firefighter.” This project would involve performing a systematic study to identify biases in the representations of such models.
Grokking: Why? When? Grokking: Why? When?
Double Descent: Why? When?

Double Descent: Why? When? Are certain hyperparameter choices necessary to observe such a phenomenon? 

Self-Supervised Learning for Time Series Anomaly Detection How can self-supervised learning be applied to detect anomalies in time series data with minimal labeled data? What architectures and pretext tasks are most effective in learning useful representations for anomaly detection in domains like industrial monitoring or healthcare?
Understanding Linear Mode Connectivity

On linear mode connectivity:

How can we explain linear model connectivity in models: Is it a dataset, or model-specific phenomenon? What does this tell us about the optimization landscape of these problems?

Theory of Transformers How do Transformer Models Implement Recurrence?

Without any explicit notion of maintaining a recurrent state, transformer models do better than recurrent neural models. Why? 

Increasing Context Length for Transformers Given the quadratic computation cost of transformers, increasing the context length is an important problem. This project would involve surveying current approaches for increasing the context length (e.g., attending only on the last K tokens, attending on logarithmically spaced tokens) and maybe coming up with your own scheme.
Can Transformers Learn Graph Connectivity? Transformers have shown impressive behavior on complex tasks in NLP such as machine translation, semantic parsing, and code generation. At the same time, they often fail on simple tasks such as arithmetic. There have been a few recent works taking a closer look at the ability of transformers to learn tasks such as arithmetic, GCD computations, and matrix operations [1–4], which has provided a bit of insight into what these transformers have been learning.
  Spatial Locality in Transformers Why do tokens on deep layers within transformers (in particular ViTs) show spatial locality, where the token that came from the top-left patch in the input image still encodes information about the top-left content? What leads to this effect? What’s the role of positional encoding. Can this effect be removed?
Learning Algorithms (Unsupervised Learning) Contrastive Time-Series Representation Learning

Traditional time series approaches, which predominantly focus on prediction tasks, lead to a “black-box” prediction. Recent literatures have explored using contrastive learning to learn time-series representation, but none has explored learning underlying system parameters. Could a contrastive learning approach be used to learn underlying system parameters parameters?

Learning a Good Dimensionality Reduction from Data Dimensionality reduction plays a key role in large scale data analysis, both in theory and practice. Theoretically, the two most used tools are random projection matrices (“worst-case” guarantees) and PCA (“average case” guarantees). In practice, dimensionality reduction is used for applications such as clustering, data summarization, and storage. In many applications, a linear map is also useful (such as in data streams or distributed computing). The goal of the project is to see if one can learn a good dimensionality reduction algorithm, such as a linear map, from the data itself for downstream applications. For example, can we write a fixed down stream application, such as clustering, as a differentiable function, and then train a good dimensionality reduction map for the task on past/training data? How can we interpolate the learned solution with prior theoretical constructions to retain some worst-case guarantees for future inputs?
  Why Do MAEs Learn Better Features than AEs? See slide 62 of lecture 12.
  Prediction-Powered Data Augmentation

In statistics, prediction-powered inference is a novel method to improve statistical inference by supplementing data with predictions generated by a machine learning model. A possible extension of this idea is to feed generated predictions back into the model as a form of semi-supervised data augmentation. What deep learning architectures could be used to facilitate this, does it provide any advantages over traditional methods, and how can we ensure a balance between data size and validity of the data?

  Platonic Representation Hypothesis

Does the Platonic representation hypothesis hold for modalities beyond vision and language? Does it hold for generative models? Does it hold for reinforcement learning? What are its limitations? Do autoencoders learn the PMI kernel? Can you use kernel alignment to translate between modalities? How does the kernel evolve through layers? How does it evolve through training? (Note: these are some of the main research questions in Phil’s lab, so I’m biased, but if you make progress on them I will be happy.)

Learning Algorithms (Reinforcement Learning) Representing Disagreement in Learning-to-Rank In ranking data such as the one used for reinforcement learning from human feedback, we would like to get accurate representations of disagreement between rankers. Learning-to-rank (L2R) tries to learn a (single) ranking between different feature mappings. This project aims to generalize L2R to settings where the ranking of some alternatives may be given coming from a small set of different rankings. Factorization/low-rank techniques could be a good starting point.
Large Language Models Reducing the # of Queries to Large Models Large models, such as large language or vision models, are typically used to obtain embeddings of data, such as text or images. The embeddings are very rich and encode semantic information about the objects. The embeddings can be then later be used for tasks such as similarity search. However, the cost (both money and environmental) of obtaining the embeddings can be large. Given a dataset, can we query the model at “very few points” which can later be extrapolated to embeddings for other data without querying the large model again?
Personalizing Large Language Models

Recent work draws inspiration from writing education to personalize text generation. What are some methods that can improve the quality of personalized responses? Alternatively, what are better evaluation metrics for personalized tasks, and how are these metrics different from other automatic evaluations?

Social Reasoning in Large Language Models

Prior works have introduced benchmarks for social commonsense reasoning. What are underlying themes in social errors that large language models make? Are there methods that could potentially address these errors?

  Hallucinations in Large Language Models One major disadvantage of traditional gradient descent algorithms is that loss functions punish uncertainty too harshly. One major consequence of this is that large language models often “hallucinate,” that is, produce incorrect or absurd explanations. Is there a way to quantify this effect, and how can we modify transformer architectures to calibrate for uncertainty?
  Pitfalls of LLM Preference Ranking

Holistic evaluation of LLM-generated text is difficult, as automated metrics are difficult to define and evaluations by humans is costly and slow. Recent work uses LLMs to rank the outputs of other LLMs, but such evaluations can be prone to unexpected biases. Are there additional biases our error modes of LLM-based ranking? What are explanations for how these biases arise? How can we mitigate them?

Misc. A Deeper Look into Equivariance for Molecular Data Conduct a comparative analysis of an SE(3) (rotation and translation) equivariant GNN and a non-equivariant GNN in molecular data tasks with a focus on investigating the interpretability of latent geometry within the two GNNs.
Benefits and Disadvantages of LSTM and Transformer Architectures for Time-Series Forecasting Study time-series forecasting from an experimental standpoint through the comparative analysis of LSTM and transformers. In the past few years, ever since the ramp-up of transformers, there has been a great debate on the usefulness of transformers in time-series forecasting such as stock prediction, traffic management, healthcare forecasting, etc. Researchers attempt to show that their Autoformer model performs better than the baseline DLinear model for a traffic dataset. Is it expected that transformers always perform better than LSTMs, or does it depend on the type of application?
Can GNNs Predict Trustworthiness?  
Exploring Generative Models in Time Series, from the Quality of Representations Learned to How Model Choices Affect the Statistical Properties of a Generated Series Can we use representation learning to segment a time series into regimes? Can representation learning be used as a form of automatic feature generation for forecasting? How does data preprocessing affect the representation learned (in the frame of contrastive learning, how would choice of positive samples, and things like window size affect the representation? There is a paper that explores using GANs for generating time series data for asset prices. The model is able to capture a lot of statistical properties desired in asset return series through preprocessing and perhaps model choices. If those model choices where to be changed, how would this affect the desired properties?
Embeddings for Spatiotemporal Forecasting  
Representaion Learning for Recommender Systems  
Augmenting Limited and Noisy Datasets Using Denoising VAEs to Enhance Training for Downstream Tasks Can we use denoising variational encoders (DVAE) to effectively learn robust feature representations from limited and noisy datasets? And can synthetic data generated by these DVAEs help in improving the performance for the downstream tasks?
Can Well-Designed Sensors Compensate for Simple Action Policies in Embodied Agents?

A blind agent might have to perform complex information-gathering actions to localize itself and navigate in an environment, whereas an agent with a camera might have a more straightforward navigation policy. Is there a relationship between the complexity (e.g., depth of a neural network) of an action policy and how much information a sensor receives? (Probably yes, but it would be nice to make a concrete analysis.)

Learning Other Vision Modalities from Embodied Vision Tasks Can an embodied agent tasked solely to find an object in a maze also do depth prediction? From past research, we know that models designed for different learning modalities share commonalities. It would be interesting to see if the latent space of an RL agent tasked to do embodied RGB object detection, for example, can learn depth.
Learning a Robot’s Belief-Space Representation Belief space planning algorithms seek to generate information-gathering actions based on an agent’s “knowledge” of where they are in their environment. For example, a particle filter might produce a (estimated) belief space of a robot’s location. Traditional belief space planning approaches use linear programming techniques to generate such actions. Can we instead learn the belief space of a robot and use this to create the most informative actions for robot navigation? Will this perform better for more complex navigation/manipulation tasks?
  Conformal Prediction in Deep Learning For many applications (such as medical diagnostics and forecasting in environmental/climate science), it is often more useful for a model to produce a set of potential output values that likely contains the true value, rather than a single most likely estimate.. This technique is called conformal prediction. Can conformal prediction be adapted to provide adaptive, valid prediction intervals for deep learning models? How does conformal prediction perform when applied to large-scale architectures, and can we reduce over-conservative intervals in high-certainty regions?
Probabilistic Deep Learning Frameworks Can we design a new probabilistic deep learning framework that extends traditional architectures (e.g., CNNs, RNNs) to predict probability distributions over outputs? What are the advantages of this approach in terms of uncertainty quantification and model interpretability?

Learning Resource Types
Lecture Notes
Lecture Videos
Problem Sets
Projects with Examples
Readings