Efficient and Distributed Generalized Canonical Correlation Analysis for Big Multiview Data

Generalized canonical correlation analysis (GCCA) integrates information from data samples that are acquired at multiple feature spaces (or ‘views’) to produce low-dimensional representations—which is an extension of classical two-view CCA. Since the 1960s, (G)CCA has attracted much attention in s...

Mô tả chi tiết

Lưu vào:
Hiển thị chi tiết
Tác giả chính: Fu, X.
Đồng tác giả: Huang, K.
Định dạng: BB
Ngôn ngữ:en_US
Thông tin xuất bản: IEEE Xplore 2020
Chủ đề:
Truy cập trực tuyến:http://tailieuso.tlu.edu.vn/handle/DHTL/9831
Từ khóa: Thêm từ khóa bạn đọc
Không có từ khóa, Hãy là người đầu tiên gắn từ khóa cho biểu ghi này!
Mô tả
Tóm tắt:Generalized canonical correlation analysis (GCCA) integrates information from data samples that are acquired at multiple feature spaces (or ‘views’) to produce low-dimensional representations—which is an extension of classical two-view CCA. Since the 1960s, (G)CCA has attracted much attention in statistics, machine learning, and data mining because of its importance in data analytics. Despite these efforts, the existing GCCA algorithms have serious complexity issues. The memory and computational complexities of the existing algorithms usually grow as a quadratic and cubic function of the problem dimension (the number of samples / features), respectively—e.g., handling views with 1;000 features using such algorithms already occupies 10 memory and the periteration complexity is 10 9 flops—which makes it hard to push these methods much further. To circumvent such difficulties, we first propose a GCCA algorithm whose memory and computational costs scale linearly in the problem dimension and the number of nonzero data elements, respectively. Consequently, the proposed algorithm can easily handle very large sparse views whose sample and feature dimensions both exceed 100;000. Our second contribution lies in proposing two distributed algorithms for GCCA, which compute the canonical components of different views in parallel and thus can further reduce the runtime significantly if multiple computing agents are available. We provide detailed convergence analyses of the proposed algorithms and show that all the largescale GCCA algorithms converge to a Karush-Kuhn-Tucker (KKT) point at least sublinearly. Judiciously designed synthetic and realdata experiments are employed to showcase the effectiveness of the proposed algorithms.