Seminar

2019

We Are Often Working on the Wrong Problem (10 misconceptions about what is important)

WHO: Michael Stonebraker @ MIT
WHEN: November 5, 2019 (Samuel Conte Distinguished Lecture Series)
ABSTRACT: In the DBMS/Data Systems area, many of us seem to have lost our way. This talk discusses 10 different problem areas in which there is considerable current research. Then, I present why I believe much of the work is misguided, either because our assumptions about these problems are incorrect or because we are not paying attention to real users. Topics considered include machine learning (deep and conventional) , public blockchain, data warehouses, schema evolution and the cloud.

Data in the Cloud

WHO: Raghu Ramakrishnan @ Microsoft
WHEN: October 29, 2019 (Samuel Conte Distinguished Lecture Series)
ABSTRACT: The cloud has forced a rethinking of database architectures. Does this offer an opportunity to address the siloed nature of data management systems? The question is especially important given the rise of machine learning and data governance. In this talk, I'll discuss these issues through the lens of the Microsoft data journey, both internal and external.

Transportation Data, Applications & Research for Smart Cities

WHO: Cyrus Shahabi @ Univ. of Southern California
WHEN: October 25, 2019
ABSTRACT: In this talk, I first introduce the Integrated Media Systems Center (IMSC), a data science research center at USC that focuses in data-driven solutions for real-world applications. IMSC is motivated by the need to address fundamental Data Science problems related to applications with major societal impact. Towards this end, I delve into one specific application domain, Transportation, and discuss the design and development of a large-scale transportation data platform and its application to address real world problems in Smart Cities. I will then continue covering some of our fundamental research in this area, in particular: 1) traffic forecasting and 2) ride matching.

Scaling Database Systems to High-performance Computers

WHO: Spyros Blanas @ Ohio State University
WHEN: February 27, 2019
ABSTRACT: We are witnessing the increasing use of warehouse-scale computers to analyze massive datasets quickly. This poses two challenges for database systems. The first challenge is interoperability with established analytics libraries and tools. Massive datasets often consist of images (arrays) in file formats like FITS and HDF5. We will first present ArrayBridge, an open-source I/O library that allows SciDB, TensorFlow and HDF5-based programs to co-exist in a pipeline without converting between file formats. The second challenge is scalability, as warehouse-scale computers expose communication bottlenecks in foundational data processing operations. We will present GRASP, a parallel aggregation algorithm for high-cardinality aggregation that avoids unscalable all-to-all communication and leverages similarity to complete the aggregation faster than repartitioning. Finally, we will present an RDMA-aware data shuffling algorithm that transmits data up to 4X faster than MPI. We conclude by highlighting additional challenges that need to be overcome to scale database systems to massive computers.

2018

How an (worst-case optimal) joins be so interesting

WHO: Semih Salihoglu @ University of Waterloo
WHEN: December 3, 2018
ABSTRACT: Worst-case optimality is perhaps the weakest notion of optimality for algorithms. A recent surprising theoretical development in databases has been the realization that the traditional join algorithms, which are based on binary joins, are not even worst-case optimal. Upon this realization, several surprisingly simple join algorithms have been developed that are provably worst-case optimal. Unlike traditional algorithms, which join subsets of tables at a time, worst-case join algorithms perform the join one attribute (or column) at a time. This talk gives an overview of several lines of work that my colleagues and I have been doing on worst-case join algorithms focusing on their application to subgraph queries. I will cover work from both distributed and serial settings. In the distributed setting, worst-case optimality is a yard-stick for two costs of an algorithm: (i) the load, i.e., amount of data per machine; and (ii) the total communication. Both load and communication complexity are at a trade-off with number of rounds an algorithm runs. I will describe how to achieve worst-case optimality in total communication and the performance of this algorithm on subgraph queries. It is an open theoretical problem to design constant-round algorithms with worst-case optimal load. In the serial setting, I will describe the optimizer of a prototype graph database called Graphflow that we are building at University of Waterloo. Graphflow's optimizer for subgraph queries mixes worst-case optimal join-style column-at-a-time processing seamlessly with traditional binary joins.