BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//pretalx//nyc2024.pydata.org//cfp//79HA87
BEGIN:VTIMEZONE
TZID:US/Eastern
BEGIN:STANDARD
DTSTART:20001029T020000
RRULE:FREQ=YEARLY;BYDAY=-1SU;BYMONTH=10;UNTIL=20061029T060000Z
TZNAME:EST
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
END:STANDARD
BEGIN:STANDARD
DTSTART:20071104T020000
RRULE:FREQ=YEARLY;BYDAY=1SU;BYMONTH=11
TZNAME:EST
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
END:STANDARD
BEGIN:DAYLIGHT
DTSTART:20000402T020000
RRULE:FREQ=YEARLY;BYDAY=1SU;BYMONTH=4;UNTIL=20060402T070000Z
TZNAME:EDT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
END:DAYLIGHT
BEGIN:DAYLIGHT
DTSTART:20070311T020000
RRULE:FREQ=YEARLY;BYDAY=2SU;BYMONTH=3
TZNAME:EDT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
END:DAYLIGHT
END:VTIMEZONE
BEGIN:VEVENT
UID:pretalx-cfp-AWLTZP@nyc2024.pydata.org
DTSTART;TZID=US/Eastern:20241106T151000
DTEND;TZID=US/Eastern:20241106T164000
DESCRIPTION:Preparing data for LLM pretraining is most challenging and time
  consuming task.  Data for pretraining is usually scraped from internet wh
 ich is full of duplicates and having undesired contents like hate\, abuse 
 and profanity. \n\nTo produce a quality model\, the collected data needs t
 o go through the series of transformations to improve its quality\, add at
 tributes ( e.g. detect language )\, make it safe ( e.g. remove spam )  and
  put it in a common format expected by training module.\n\nData Prep Kit (
 https://github.com/IBM/data-prep-kit ) aka DPK is a open source project ( 
 created by IBM ) to transform the input data collected from internet (http
 s://commoncrawl.org/ ) into data ready for training.\n \nThis hands-on tut
 orial will be a working session to understand Data preparation steps for L
 LMs\, how Data Prep Kit(DPK) works\, how to run DPK transforms in stages\,
  how to scale data processing and how to build a new transform.
DTSTAMP:20250709T215038Z
LOCATION:Central Park West
SUMMARY:Preparing Data for LLM pretraining using open source Data Prep Kit 
 - Santosh Borse
URL:https://nyc2024.pydata.org/cfp/talk/AWLTZP/
END:VEVENT
BEGIN:VEVENT
UID:pretalx-cfp-X8NXE3@nyc2024.pydata.org
DTSTART;TZID=US/Eastern:20241108T114000
DTEND;TZID=US/Eastern:20241108T122000
DESCRIPTION:Preparing data for LLM pretraining is most challenging and time
  consuming task. Data for pretraining is usually scraped from internet whi
 ch is full of duplicates and having undesired contents like hate\, abuse a
 nd profanity.\n\nTo produce a quality model\, the collected data needs to 
 go through the series of transformations to improve its quality\, add attr
 ibutes ( e.g. detect language )\, make it safe ( e.g. remove spam ) and pu
 t it in a common format expected by training module.\n\nData Prep Kit (htt
 ps://github.com/IBM/data-prep-kit ) aka DPK is a open source project ( cre
 ated by IBM ) to transform the input data collected from internet (https:/
 /commoncrawl.org/ ) into data ready for training.
DTSTAMP:20250709T215038Z
LOCATION:Music Box
SUMMARY:Preparing data for LLM training with Data Prep Kit - Santosh Borse
URL:https://nyc2024.pydata.org/cfp/talk/X8NXE3/
END:VEVENT
END:VCALENDAR