如何下載 TED 的字幕

TED是一個演講計劃,也可以用做練習英聽的題目,問題是和增進英文比較起來,我對演講內容比較有興趣,所以我還是寧可看中文字幕。以往我都是透過 http://tedtalksubtitledownload.appspot.com/ 來下載字幕,問題是 TED 的網頁改版了!!所以上面這個網頁沒辦法用,因此本人就決定自己來打造一個 TED 的字幕下載工具(順便當作 Python 的程式練習)。


import urllib

def parseTime( str ):
    time = str.split("data-time='")[1].split("'")[0]
    return time

def parseSub( str ):
    subtitle = str.split(">")[1].split("<")[0]
    return subtitle

def time2Str( time ):
    minisecond = time % 1000
    second = ( time / 1000 ) % 60
    minute = ( ( time / 1000 ) / 60 ) % 60
    hour = ( ( ( time / 1000 ) / 60 ) / 60 ) % 60
    miniSecStr = str( minisecond )
    secStr = str( second )
    minStr = str( minute )
    hrStr = str( hour )
    
    # SRT time format: 00:02:52,184 --> 00:02:53,617
    if len( miniSecStr ) < 3:
        for i in range( 0, ( 3 - len( miniSecStr ) ), 1 ):
            miniSecStr = "0" + miniSecStr
    
    if len( secStr ) < 2:
        secStr = "0" + secStr
    if len( minStr ) < 2:
        minStr = "0" + minStr
    if len( hrStr ) < 2:
        hrStr = "0" + hrStr
    
    return hrStr + ":" + minStr + ":" + secStr + "," + miniSecStr

def getSubtitleList( link ):
    
    if not link.endswith( '/' ):
        link = link + "/"
    url = link + "transcript"
    
    subtitleList=[]
    flag = 0
    
    web = urllib.urlopen( url )
    
    for line in web.readlines():
        # The start of transcript options
        if "talk-transcript__language" in line:
            flag = 1
            continue
        # The end of transcript options
        if ( "</select>" in line ) & ( flag == 1 ):
            break
        # Parse options
        # Example: <option value='zh-cn'>Chinese, Simplified</option>
        if flag == 1:
            lang = line.split(">")[1].split("<")[0]
            abbr = line.split("'")[1]
            subtitleList.append( ( abbr, lang ) )
    
    web.close()
    return subtitleList

def genSubtitle( link, lang ):

    if not link.endswith( '/' ):
        link = link + "/"
    url = link + "transcript?lang=" + lang
    
    web = urllib.urlopen( url )
    
    index = 1
    times=[]
    subtitles=[]
    
    for line in web.readlines():
        if "talk-transcript__fragment" in line:
            time = parseTime( line )
            subtitle = parseSub( line )
            # 12 seconds is for the video opening
            times.append( int( time ) + 12000 )
            subtitles.append( subtitle )
    
    web.close()
    
    size = len( subtitles )
    
    # Since the last subtitle has no end time, we set 5 seconds directly.
    times.append( times[ ( size - 1 ) ] + 5000 )
    
    # SRT format example:
    # 45
    # 00:02:52,184 --> 00:02:53,617
    # Hello World
    
    for i in range( 0, size , 1 ):
        print i+1
        print time2Str( times[i] ) + " --> " + time2Str( times[i+1] )
        print subtitles[i]
        print
    
def main():
    options = getSubtitleList( "http://www.ted.com/talks/charmian_gooch_meet_global_corruption_s_hidden_players/" )
    
    size = len( options )
    
    for i in range( 0, size , 1 ):
        print options[i]
    
    genSubtitle( "http://www.ted.com/talks/charmian_gooch_meet_global_corruption_s_hidden_players/", "en" )

main()
很明顯,沒有操作的介面,也許之後一樣改成一個 WEB 的服務吧。

留言

這個網誌中的熱門文章

如何將Linux打造成OpenFlow Switch:Openvswitch

如何利用 Wireshark 來監聽 IEEE 802.11 的管理封包

我弟家的新居感恩禮拜分享:善頌善禱